Welcome

About this handbook

Objective

A free open-access digital R reference book catered to epidemiologists and public health practitioners that is usable offline and addresses common epidemiological tasks via clear text explanations, step-by-step instructions, and best practice R code examples

Epis using R must often Google search and read dozens of forum pages to complete common data manipulation and visualization epi tasks. However, field epidemiologists often work in low internet-connectivity environments and have limited technical support. This handbook aims to fill this gap.

How to read this handbook:

  • The is an HTML file which can be viewed offline, and is best viewed with Google Chrome.

  • Search via the search box above the Table of Contents. Ctrl+f will search across the current page.

  • Click the “clipboard” icon in the upper-right of each code chunk to copy it.

Version
The latest version of this handbook can be found at this github repository.

Acknowledgements

Contributors

Editor-in-Chief: Neale Batra ()

Editorial core team:

Authors:

Reviewers:

Advisers

Funding and programmatic support

TEPHINET
EAN

Data sources

outbreaks R package

Inspiration and templates

R4Epis
RECON packages
R4DS book (Hadley)
Bookdown book (Yihui)
Rmarkdown book (Yihui)

Image credits

Logo: CDC Public Image gallery; R Graph Gallery

I About this book

Style and editorial notes

Style

Text style

Package and function names

Package names are written in bold (e.g. dplyr) and functions are written like this: mutate(). Packages referenced either in text or within code like this: dplyr::mutate()

Types of notes

NOTE: This is a note

TIP: This is a tip.

CAUTION: This is a cautionary note.

DANGER: This is a warning.

tidyverse

This handbook generally uses tidyverse R coding style. Read more here

Code readability

We chose to frequently write code on new lines, in order to offer more understandable comments. As a result, code that could be written like this:

obs %>% 
  group_by(name) %>%                    # group the rows by 'name'
  slice_max(date, n = 1, with_ties = F) # if there's a tie (of date), take the first row

…is often written like this:

obs %>% 
  group_by(name) %>%   # group the rows by 'name'
  slice_max(
    date,              # keep row per group with maximum date value 
    n = 1,             # keep only the single highest row 
    with_ties = F)     # if there's a tie (of date), take the first row

Editorial decisions

Below, we track significant editorial decisions around package and function choice. If you disagree or want to offer a new tool, please join/start a conversation on our Github page.

Table of package, function, and other editorial decisions

Subject Considered Outcome & date Brief rationale
Epiweeks aweek, lubridate lubridate, Dec 2020 consistency, package maintenance prospects

Datasets used

Here the datasets used in this handbook will be described and will be downloadable

  • Linelist (…)
  • Aggregated case counts (…)
  • GIS shapefile (…)
  • modeling dataset? (…)

II Data Management

Cleaning data

Overview

This page demonstrates common steps necessary to clean a dataset. It uses a simulated Ebola case linelist, which is used throughout the handbook.

  • Dealing with character case (upper, lower, title, etc.)
  • Factor columns

replace missing with dealing with cases (all lower, etc) case_when() factors

Preparation

Preparation

Load packages

pacman::p_load(tidyverse,  # data manipulation and visualization
               janitor,    # data cleaning
               rio,        # importing data
               epikit)     # age_categories() function  

Load data

Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)

## New names:
## * `` -> ...28
linelist_raw <- import("linelist_raw.xlsx")

You can view the first 50 rows of the the original “raw” dataset below:

Cleaning pipeline

Cleaning pipeline

In epidemiological analysis and data processing, cleaning steps are often performed together and sequentially. In R this often manifests as a cleaning “pipeline”, where the raw dataset is passed or “piped” from one cleaning step to another. The chain utilizes dplyr verbs and the magrittr pipe operator (see handbook page on dplyr and tidyverse coding style (LINK HERE). The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).

In a cleaning pipeline the order of the steps is important. Cleaning steps might include:

  • Importing of data
  • Column names cleaned or changed
  • Rows filtered, added, or de-duplicated
  • Columns selected, added, transformed, or re-ordered
  • Values re-coded, cleaned, or grouped

Column names

Column names

Column names are used very often so they need to have “clean” syntax. We suggest the following:

  • Short names
  • No spaces (replaced with underscores (_),
  • No unusual characters (&, #…)
  • Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…)

The columns names of linelist_raw are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).

Also note that in the raw data, the two final columns names were two merged cells with one name. The import() function used the name for the first of the two columns, and assigned the second column the name “…23” as it was then empty (referring to the 23rd column).

names(linelist_raw)
##  [1] "row_num"         "case_id"         "generation"      "infection date"  "date onset"      "hosp date"       "date_of_outcome"
##  [8] "outcome"         "gender"          "hospital"        "lon"             "lat"             "infector"        "source"         
## [15] "age"             "wt_kg"           "ht_cm"           "ct_blood"        "age_unit"        "fever"           "chills"         
## [22] "cough"           "aches"           "vomit"           "temp"            "time_admission"  "merged_header"   "...28"
Note: For a column name that include spaces, surround the name with back-ticks, for example: linelist$`infection date`. On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).

Automatic colummn name cleaning

Automatic column name cleaning

The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:

  • Converts all names to consist of only underscores, numbers, and letters
  • Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”)
  • Capitalization preference can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)
  • You can designate specific name replacements with the replace = argument (e.g. replace = c(onset = “date_of_onset”))
  • Here is an online vignette

Below, the cleaning pipeline begins by using clean_names() on the raw linelist.

# send the dataset through the function clean_names()
linelist <- linelist_raw %>% 
  janitor::clean_names()

# see the new names
names(linelist)
##  [1] "row_num"         "case_id"         "generation"      "infection_date"  "date_onset"      "hosp_date"       "date_of_outcome"
##  [8] "outcome"         "gender"          "hospital"        "lon"             "lat"             "infector"        "source"         
## [15] "age"             "wt_kg"           "ht_cm"           "ct_blood"        "age_unit"        "fever"           "chills"         
## [22] "cough"           "aches"           "vomit"           "temp"            "time_admission"  "merged_header"   "x28"

NOTE: The column name “…28” was changed to “x28”.

Manual column name cleaning

Manual column name cleaning

Re-naming columns manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new column name is given before the old column name.

Below, a re-name command is added to the cleaning pipeline:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome)

Now you can see that the columns names have been changed:

Rename by column position

You can also rename by column position, instead of column name, for example:

rename(newNameForFirstColumn = 1,
       newNameForSecondColumn = 2)

Empty Excel column names

If you importing an Excel sheet with a missing column name, depending on the import function used, R will likely create a column name with a value like “…1” or “…2”. You can clean these names manually by referencing their position number (see above), or their name (linelist_raw$...1).

Merged Excel column names

Merged Excel column names

Merged cells in an Excel file are a common occurrence when receiving data from field level. Merged cells can be nice for human reading of data, but cause many problems for machine reading of data. R cannot accommodate merged cells.

Remind people doing data entry that human-readable data is not the same as machine-readable data. Strive to train users about the princiles of tidy data. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.

One solution to deal with merged cells is to import the data with the function readWorkbook() from package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.

linelist_raw <- openxlsx::readWorkbook("linelist_raw.xlsx", fillMergedCells = TRUE)

DANGER: If column names are merged, you will end up with duplicate column names, which you will need to fix manually - R does not work well with duplicate column names! You can re-name them by referencing their position (e.g. column 5), as explained in the section on manual column name cleaning..

Skip import of rows

Skip import of rows

Sometimes, you may want to avoid importing a row of data (e.g. the column names, which are row 1).
you can do this with the argument skip = if using import() from the rio package on a .xlsx or .csv file. Provide the number of rows you want to skip.

linelist_raw <- import("linelist_raw.xlsx", skip = 1)  # does not import header row

Unfortunately skip = only accepts one integer value, not a range (e.g. “2:10”). To skip import of specific rows that are not consecutive from the top, consider importing multiple times and using bind_rows() from dplyr. See the example below of skipping only row 2.

Removing a second header row

Your data may have a second row of data, for example if it is a “data dictionary” row (see example below).

This situation can be problematic because it can result in all columns being imported as class “character”. To solve this, you will likely need to import the data twice.

  1. Import the data in order to store the correct column names
  2. Import the data again, skipping the first two rows (header and second rows)
  3. Bind the correct names onto the reduced dataframe

The exact arguments used to bind the correct column names depends on the type of data file (.csv, .tsv, .xlsx, etc.). If using rio’s import() function, understand which function rio uses to import your data, and then give the appropriate argument to skip lines and/or designate the column names. See the handbook page on importing data (LINK) for details on rio.

For Excel files:

# For excel files (remove 2nd row)
linelist_raw_names <- import("linelist_raw.xlsx") %>% names()  # save true column names

# import, skip row 2, assign to col_names =
linelist_raw <- import("linelist_raw.xlsx", skip = 2, col_names = linelist_raw_names) 

For CSV files:

# For csv files
linelist_raw_names <- import("linelist_raw.csv") %>% names() # save true column names

# note argument is 'col.names ='
linelist_raw <- import("linelist_raw.csv", skip = 2, col.names = linelist_raw_names) 

Backup option - changing column names as a separate command

# assign/overwrite headers using the base 'colnames()' function
colnames(linelist_raw) <- linelist_raw_names

Bonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it using the gather() command from the tidyr package.
source: https://alison.rbind.io/post/2018-02-23-read-multiple-header-rows/

TO DO

library(tidyr)
stickers_dict <- import("linelist_raw.xlsx") %>% 
  clean_names() %>% 
  gather(variable_name, variable_description)
stickers_dict

Combine two header rows

Combine two header rows

In some cases, you may want to combine two header rows into one. This command will define the column names as the combination (pasting together) of the existing column names with the value underneath in the first row. Replace “df” with the name of your dataset.

names(df) <- paste(names(df), df[1, ], sep = "_")

Select or re-order columns

Select or re-order columns

CAUTION: This tab may follow from previous tabs.

Often the first step of cleaning data is selecting the columns you want to work with, and to set their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.

CAUTION: In the examples below, linelist is modified with select() but not over-written. New column names are only displayed for purpose of example.

Here are all the column names in the linelist:

names(linelist)
##  [1] "row_num"              "case_id"              "generation"           "date_infection"       "date_onset"          
##  [6] "date_hospitalisation" "date_outcome"         "outcome"              "gender"               "hospital"            
## [11] "lon"                  "lat"                  "infector"             "source"               "age"                 
## [16] "wt_kg"                "ht_cm"                "ct_blood"             "age_unit"             "fever"               
## [21] "chills"               "cough"                "aches"                "vomit"                "temp"                
## [26] "time_admission"       "merged_header"        "x28"

Select & re-order

Select & re-order

Select only the columns you want to remain, and their order of appearance

# linelist dataset is piped through select() command, and names() prints just the column names
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, fever) %>% 
  names() # display the column names
## [1] "case_id"              "date_onset"           "date_hospitalisation" "fever"

Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained. Inside select() you can use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.

linelist %>% 
  select(-c(date_onset, fever:vomit)) %>% # remove onset and all symptom columns
  names()
##  [1] "row_num"              "case_id"              "generation"           "date_infection"       "date_hospitalisation"
##  [6] "date_outcome"         "outcome"              "gender"               "hospital"             "lon"                 
## [11] "lat"                  "infector"             "source"               "age"                  "wt_kg"               
## [16] "ht_cm"                "ct_blood"             "age_unit"             "temp"                 "time_admission"      
## [21] "merged_header"        "x28"

Re-order the columns - use everything() to signify all other columns not specified in the select() command:

# move case_id, date_onset, date_hospitalisation, and gender to beginning
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, gender, everything()) %>% 
  names()
##  [1] "case_id"              "date_onset"           "date_hospitalisation" "gender"               "row_num"             
##  [6] "generation"           "date_infection"       "date_outcome"         "outcome"              "hospital"            
## [11] "lon"                  "lat"                  "infector"             "source"               "age"                 
## [16] "wt_kg"                "ht_cm"                "ct_blood"             "age_unit"             "fever"               
## [21] "chills"               "cough"                "aches"                "vomit"                "temp"                
## [26] "time_admission"       "merged_header"        "x28"

As well as everything() there are several special functions that work within select(), namely:

  • everything() - all other columns not mentioned
  • last_col() - the last column
  • where() - applies a function to all columns and selects those which are TRUE
  • starts_with() - matches to a specified prefix. Example: select(starts_with("date"))
  • ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))
  • contains() - columns containing a character string. Example: select(contains("time"))
  • matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))
  • num_range() -
  • any_of() - matches if column is named. Useful if the name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))

Here is an example using where():

# select columns containing certain characters
linelist %>% 
  select(contains("date")) %>% 
  names()
## [1] "date_infection"       "date_onset"           "date_hospitalisation" "date_outcome"
# searched for multiple character matches
linelist %>% 
  select(matches("onset|hosp|fev")) %>%   # note the OR symbol "|"
  names()
## [1] "date_onset"           "date_hospitalisation" "hospital"             "fever"

select() as a stand-alone command

select() as a stand-alone command

Select can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.

# Create a new linelist with id and age-related columns
linelist_age <- select(linelist, case_id, contains("age"))

# display the column names
names(linelist_age)
## [1] "case_id"  "age"      "age_unit"

Add to the pipe chain

Add to the pipe chain

In the linelist, there are a few columns we do not need: row_num, merged_header, and x28. Remove them by adding a select() command to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28))

Modify class

Modify class

CAUTION: This tab may follow from previous tabs.

See section on object classes

Often you will need to set the correct class for a column. The most common approach is to use mutate() to define the column as itself, but with a different class. Generally, this looks like this:

# Examples of modifying class
linelist <- linelist %>% 
  mutate(date_var      = as.Date(date_var, format = "MM/DD/YYYY"),  # format should be the format of the raw data
         numeric_var   = as.numeric(numeric_var),
         character_var = as.character(character_var),
         factor_var    = factor(factor_var, levels = c(), labels = c())
         )

Pre-checks and errors

Pre-checks and errors

First we run some checks on the classes of important columns.

The class of the “age” column is character. To perform analysis, we need those numbers to be recognized as numeric!

class(linelist$age)
## [1] "character"

The class of the “date_onset” column is also character! To perform analysis, these dates must be recognized as dates!

class(linelist$date_onset)
## [1] "character"

However, if we try to classify the date_onset column as date, we would get an error. Use table() or sort or another method to examine all the values and identify different one. For example in our dataset we see that we see that one date_onset value was entered in a different format (15th April 2014) than all the other values!

## 
## 15th April 2014      2012-04-21      2012-05-09      2012-05-14      2012-05-27      2012-06-22 
##               1               1               1               1               2               1

Before we can classify “date_onset” as a date, this value must be fixed to be the same format as the others. You can fix the date in the source data, or, we can do in the cleaning pipeline via mutate() and recode(). This must be done before the commands to convert to class Date. (LINK TO DATE SECTION).

# fix incorrect values                 # old value       # new value
mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15"))

The mutate() line can be read as: “mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE”. Note that this pattern (OLD = NEW) for recode() is the opposite of most R patterns (new = old). The R development community is working on revising this for recoding.

Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! For as.Date(), the format = argument is often a source of errors.

Modify multiple columns

Modify multiple columns

You can use The dplyr function across() with mutate() to convert several columns at once to a new class. across() allows you to specify which columns you want a function to apply to. Below, we want to mutate the columns where is.POSIXct() (a type of date/time class that shows unnecessary timestamps) is TRUE, and apply the function is.Date() to them, in order to convert them to class “date”.

  • Note that within across() we also use the function where().
  • Note that is.POSIXct is from the package lubridate. Other similar functions (is.character(), is.numeric(), and is.logical()) are from base R
  • Note that the functions in across() are written without the empty parentheses ()
linelist <- linelist %>% 
  mutate(across(where(lubridate::is.POSIXct), as.Date))

Below, the described cleaning steps are added to the pipe chain.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 

  
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
  
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) 

Add columns and rows

Add columns and rows

See the tabs below to add columns and rows

Add columns

Add columns

mutate()

mutate()

We advise creating new columns with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use mutate() or the base R style to create a new column (see below).

The verb mutate() is used to add a new column, or to modify an existing one. Below is an example of creating a new columns with mutate(). The syntax is: new_column_name = value or function.

linelist <- linelist %>% 
  mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset))

It is best practice to separate each new column with a comma and new line. Below, some practice columns are created:

linelist <- linelist %>%                       # creating new, or modifying old dataset
  mutate(new_var_dup    = case_id,             # new column = duplicate/copy another column
         new_var_static = 7,                   # new column = all values the same
         new_var_static = new_var_static + 5,  # you can overwrite a column, and it can be a calculation using other variables
         new_var_paste  = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
         ) 

Scroll to the right to see the new columns (first 50 rows):

TIP: The verb transmute() adds new columns just like mutate() but also drops/removes all other columns that you do not mention.

New columns using base R

New columns using base R

To define a new column (or re-define a column) using base R, just use the assignment operator as below. Remember that when using base R you must specify the dataframe before writing the column name (e.g. dataframe$column). Here are two dummy examples:

linelist$old_var <- linelist$old_var + 7
linelist$new_var <- linelist$old_var + linelist$age

Add rows

Add rows

TO DO

Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.

linelist <- linelist %>% 
  add_row(row_num = 666, case_id = "abc", generation = 4, `infection date` = as.Date("2020-10-10"), .before = 2)

use .before and .after. .before = 3 will put it before the 3rd row. Default is to add it to the end. columns not specified will be let empty. The new row number may look strange (“…23”) but the row numbers have changed. So if using the command twice examine/test carefully.

If your class is off you will see an error like this: Error: Can’t combine ..1$infection date and ..2$infection date . (for a date value remember to wrap the date in the functionas.Date() like as.Date("2020-10-10"))

New columns using grouped values

New columns using grouped values

CAUTION: This tab may follow from previous tabs.

Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html

Taken from website above:

#Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as #soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:

starwars %>%
  select(name, mass, species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:

starwars %>%
  select(name, mass, species) %>%
  group_by(species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.

Add to pipe chain

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 

  # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################

  # create column: delay to hospitalisation
  mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset))

Recoding values

Recoding values

For example, in linelist the values in the column “hospital” must be cleaned. There are several different spellings (often the word “Hospital” is missing an “s” and is written “Hopital”), and many missing values.

table(linelist$hospital, useNA = "always")
## 
##                      Central Hopital                     Central Hospital                           Hospital A 
##                                   11                                  454                                  289 
##                           Hospital B                     Military Hopital                    Military Hospital 
##                                  289                                   31                                  802 
##                     Mitylira Hopital                    Mitylira Hospital                                Other 
##                                    1                                   82                                  902 
##                         Port Hopital                        Port Hospital St. Mark's Maternity Hospital (SMMH) 
##                                   47                                 1760                                  426 
##   St. Marks Maternity Hopital (SMMH)                                 <NA> 
##                                   11                                 1504

Manual recoding

Manual recoding

These tabs demonstrate re-coding values manually b providing specific spellings to be corrected:

  • Using replace() for specific rows
  • Using recode() for entire columns
  • Using base R

replace()

To manually change values for specific rows within a dataframe (from within a pipe chain), use replace() within mutate().
Use a logic condition to specify rows, for example an ID value of one row. The general syntax is:

mutate(col_to_change = replace(col_to_change, criteria for rows, new value)).

In the first example below, the gender value, in the row where id is “2195”, is changed to “Female”.

# Example: change gender of one specific observation to "Female" 
mutate(gender = replace(gender, id == "2195", "Female")

# Example: chance gender of one specific observation to NA 
mutate(gender = replace(gender, id == "2195", NA)

recode()

recode()

To change spellings manually, one-by-one, you can use the recode() function *within the mutate() function. The code is saying that the column “hospital” should be defined as the current column “hospital”, but with certain changes (the syntax is OLD = NEW). Don’t forget commas!

linelist <- linelist %>% 
  mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      ))

Now we see the values in the hospital column have been corrected:

table(linelist$hospital, useNA = "always")
## 
##                     Central Hospital                           Hospital A                           Hospital B 
##                                  465                                  289                                  289 
##                    Military Hospital                                Other                        Port Hospital 
##                                  916                                  902                                 1807 
## St. Mark's Maternity Hospital (SMMH)                                 <NA> 
##                                  437                                 1504

TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.

TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").

base R

base R

If necessary, you make manual changes to a specific value in a dataframe by referencing the row number of case ID. But remember it is better if you can make these changes permanently in the underlying data!

Here is a fake example. It reads as “Change the value of the dataframe linelist‘s column onset_date (for the row where linelist’s column case_id has the value ’9d4019’) to as.Date("2020-10-24")”.

linelist$date_onset[linelist$case_id == "9d4019"] <- as.Date("2020-10-24")

Recoding by logic

Recoding by logic

These tabs demonstrate re-coding values in a column using logic and conditions:

  • Using case_when()
  • Using ifelse() and if_else()
  • Using special dplyr recoding functions like:
    • replace_na()
    • na_if()
    • coalesnce()

case_when()

case_when()

If you need to use logic statements to recode values, or want to use operators like %in%, use dplyr’s case_when() instead. If you use case_when() please read the thorough explanation HERE LINK, as there are important differences from recode() in syntax and logic order!

Note that all Right-hand side (RHS) inputs must be of the same class (e.g. character, numeric, logical). Notice the use of the special value NA_real_ instead of just NA.

linelist <- linelist %>% 
  dplyr::mutate(age_years = case_when(
            age_unit == "years"  ~ age,       # if age is given in years
            age_unit == "months" ~ age/12,    # if age is given in months
            is.na(age_unit)      ~ age,       # if age unit is missing, assume years
            TRUE                 ~ NA_real_)) # Any other circumstance

ifelse() and if_else()

ifelse() and if_else()

For simple uses of logical re-coding or new variable creationgyou can use ifelse() or if_else(). Though in most cases it is better to use case_when().

These commands are simplified versions of an if and else statement. The general syntax is ifelse(condition, value if condition evaluates to TRUE, value if condition evaluates to FALSE). If used in a mutate(), each row is evaluated. if_else() is a special version from dplyr that handles dates in the condition.

It can be tempting to string together many ifelse commands… resist this and use case_when() instead! It is much more simple, easier to read, and easier to identify errors.

IMAGE of ifelse string with X across is.

You can reference other columns with the ifelse() function within mutate():

Example of ifelse():

linelist <- linelist %>% 
  mutate(source_known = ifelse(!is.na(source), "known", "unknown"))

Example of if_else() (using dates): Note that if the ‘true’ value is a date, the ‘false’ value must also qualify a date, hence using the special character NA_real_ instead of just NA.

linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))

Note: If you want to alternate a value used in your code based on other circumstances, consider using switch() from base R. For example if… TO DO. See the section on using switch() in the page on R interactive console.

Recoding using special dplyr functions

Recoding using special dplyr functions

Using replace_na()

To change missing values (NA) to a specific character value, such as “Missing”, use the function replace_na() within mutate(). Note that this is used in the same manner as recode above - the name of the variable must be repeated within replace_na().

linelist <- linelist %>% 
  mutate(hospital = replace_na(hospital, "Missing"))

Using na_if()

Likewise you can quickly convert a specific character value to NA using na_if(). The command below is the opposite of the one above. It converts any values of “Missing” to NA.

linelist <- linelist %>% 
  mutate(hospital = na_if(hospital, "Missing"))

Using coalesce()

This dplyr function finds the first non-missing value at each position. So, you provide it with columns and for each row it will fill the value with the first non-missing value in the columns you provided.

For example, you might use thiscoalesce()` create a “location” variable from hypothetical variables “patient_residence” and “reporting_jurisdiction”, where you prioritize patient residence information, if it exists.

linelist <- linelist %>% 
  mutate(location = coalesce(patient_residence, reporting_jurisdiction))

TO DO lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),

Recoding using cleaning dictionaries

Recoding using cleaning dictionaries

CAUTION: This tab may follow from previous tabs.

## load cleaning rules and only keep columns in mll
mll_cleaning_rules <- import(here("dictionaries/mll_cleaning_rules.xlsx")) %>%
  filter(column %in% c(names(mll_raw), ".global"))

## define columns that are not cleand
unchanged <- c(
  "epilink_relationship",
  "narratives",
  "epilink_relationship_detail"
)

mll_clean <- mll_raw %>%
  ## convert to tibble
  as_tibble() %>%
  ## clean columns using cleaning rules
  clean_data(
    wordlists = mll_cleaning_rules,
    protect = names(.) %in% unchanged
  )

Add to pipe chain

Add to pipe chain

Here we add the described cleaning steps to the pipe chain.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
  
    # create column: delay to hospitalisation
  mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 

# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################

    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_))

Filter rows

Filter rows

CAUTION: This tab may follow from previous tabs.

A typical early cleaning step is to filter the dataframe for specific rows using the dplyr verb filter(). Within filter(), give the logic that must be TRUE for a row in the dataset to be kept.

The tabs below show how to filter rows based on simple and complex logical conditions, and how to filter/subset rows as a stand-alone command and with base R

A simple filter()

A simple filter()

This simple example re-defines the dataframe linelist as itself, having filtered the rows to meet a logical condition. Only the rows where the logical statement within the parentheses is TRUE are kept.

In this case, the logical statement is !is.na(case_id), which is asking whether the value in the column case_id is not missing (NA). Thus, rows where case_id is not missing are kept.

Before the filter is applied, the number of rows in linelist is 6609.

linelist <- linelist %>% 
  filter(!is.na(case_id))  # keep only rows where case_id is not missing

After the filter is applied, the number of rows in linelist is 6605.

A complex filter()

A complex filter()

A more complex example using filter():

Examine the data

Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this dataset. For our analyses, we want to remove entries from this earlier outbreak.

hist(linelist$date_onset, breaks = 50)

How filters handle missing numeric and date values

Can we just filter by date_onset to rows after June 2013? Caution! Applying the code filter(date_onset > as.Date("2013-06-01"))) would accidentally remove any rows in the later epidemic with a missing date of onset!

DANGER: Filtering to greater than (>) or less than (<) a date or number can remove any rows with missing values (NA)! This is because NA is treated as infinitely large and small.

Design the filter

Examine a cross-tabulation to make sure we exclude only the correct rows:

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2012 2013 2014 2015 <NA>
##   Central Hospital                        0    0  351   99   15
##   Hospital A                            231   41    0    0   16
##   Hospital B                            228   40    0    0   21
##   Military Hospital                       0    0  679  204   33
##   Missing                                 0    0 1119  322   60
##   Other                                   0    0  685  173   44
##   Port Hospital                           7    2 1368  344   86
##   St. Mark's Maternity Hospital (SMMH)    0    0  330   93   14
##   <NA>                                    0    0    0    0    0

What other criteria can we filter on to remove the first outbreak from the dataset? We see that:

  • The first epidemic occurred at Hospital A, Hospital B, and that there were also 10 cases at Port Hospital.
  • Hospitals A & B did not have cases in the second epidemic, but Port Hospital did.

We want to exclude:

  • The 586 rows with onset in 2012 and 2013 at either hospital A, B, or Port:
    • Exclude the 549 rows with onset in 2012 and 2013
    • Exclude the 37 rows from Hospitals A & B with missing onset dates
    • Do not exclude the 252 other rows with missing onset dates.

We start with a linelist of nrow(linelist). Here is our filter statement:

linelist <- linelist %>% 
  # keep rows where onset is after 1 June 2013 OR where onset is missing and it was a hospital OTHER than Hospital A or B
  filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

nrow(linelist)
## [1] 6019

When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, and the 10 Port Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2014 2015 <NA>
##   Central Hospital                      351   99   15
##   Military Hospital                     679  204   33
##   Missing                              1119  322   60
##   Other                                 685  173   44
##   Port Hospital                        1368  344   86
##   St. Mark's Maternity Hospital (SMMH)  330   93   14
##   <NA>                                    0    0    0

Multiple statements can be included within one filter command (separated by commas), or you can always pipe to a separate filter() command for clarity.

Note: some readers may notice that it would be easier to just filter by date_hospitalisation because it is 100% complete. This is true. But for pdate_onset is used for purposes of a complex filter example.

Filter as a stand-alone command

Filter as a stand-alone command

Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.

# dataframe <- filter(dataframe, condition(s) for rows to keep)

linelist <- filter(linelist, !is.na(case_id))

You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.

# dataframe <- dataframe[row conditions, column conditions] (blank means keep all)

linelist <- linelist[!is.na(case_id), ]

TIP: Use bracket-subset syntax with View() to quickly review a few records.

Filtering to quickly review data

Filtering to quickly review data

This base R syntax can be handy when you want to quickly view a subset of rows and columns. Use the base R View() command (note the capital “V”) around the [] subset you want to see. The result will appear as a dataframe in your RStudio viewer panel. For example, if I want to review onset and hospitalization dates of 3 specific cases:

View the linelist in the viewer panel:

View(linelist)

View specific data for three cases:

View(linelist[linelist$case_id %in% c("11f8ea", "76b97a", "47a5f5"), c("date_onset", "date_hospitalisation")])

Note: the above command can also be written with dplyr verbs filter() and select() as below:

View(linelist %>%
       filter(case_id %in% c("11f8ea", "76b97a", "47a5f5")) %>%
       select(date_onset, date_hospitalisation))

Add to pipe chain

Add to pipe chain

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
        select(-c(row_num, merged_header, x28)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
  
    
    # create column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 

    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years"  ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit)      ~ age,
          TRUE                 ~ NA_real_)) %>% 
    
  # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

Numeric categories

Numeric categories

CAUTION: This tab may follow from previous tabs.

Special approaches for creating numeric categories

Common examples include age categories, groups of lab values, etc.

There are several ways to create categories of a numeric column such as age. Here we will discuss:

  • age_categories(), from the epikit package
  • cut(), from base R
  • using percentiles to break your numbers
  • natural break points… ? TO DO
  • case_when()

Sometimes, numeric variables will import as class “character”. This occurs if there are non-numeric characters in some of the values, for example an entry of “2 months” for age, or (depending on your R locale settings) if a comma is used in the decimals place (e.g. “4,5” to mean four and one half years).

For this example we will create an age_cat column using the age_years column.

#check the class of the linelist variable age
class(linelist$age_years)
## [1] "numeric"

age_categories()**

age_categories()

With the epikit package, you can use the age_categories() function to easily categorize and label numeric columns (note: this can be applied to non-age numeric variables too). The output is an ordered factor.

The break values specified are included in the higher group, that is groups are open on the lower/left side. As shown below, you can add 1 to each break value to achieve groups that are open at the top/right.

Other optional arguments:

  • lower = Default is 0). The lowest number you want considered.
  • upper = The highest number you want considered.
  • by = The number of years between groups.
  • separator = Default is “-”. Character between ages in labels.
  • ceiling = Default FALSE. If TRUE, the highest break value is a ceiling and a category “XX+” is not included. Any values above highest break or upper (if defined) are categorized as NA.

See the function’s Help page for more details (enter ?age_categories in the R console).

library(epikit)

# Simple example
################
linelist <- linelist %>% 
  mutate(age_cat = age_categories(age_years,
                                  breakers = c(0, 5, 10, 15, 20, 30, 50, 70)))
# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##  1097  1177  1006   855  1108   639    46     0    91
# With ceiling set to TRUE
##########################
linelist <- linelist %>% 
  mutate(age_cat = age_categories(age_years, 
                                  breakers = c(0, 5, 10, 15, 20, 30, 50, 70),
                                  upper = max(linelist$age_years, na.rm=T),
                                  ceiling = TRUE)) # 70 is the ceiling
# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-70  <NA> 
##  1097  1177  1006   855  1108   639    46    91
# Include upper ends for the same categories
############################################
linelist <- linelist %>% 
  mutate(age_cat = age_categories(age_years, 
                                  upper = max(linelist$age_years, na.rm=T),
                                  breakers = c(0, 6, 11, 16, 21, 31, 51, 71, 76)))
# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-5  6-10 11-15 16-20 21-30 31-50 51-70 71-75   76+  <NA> 
##  1338  1160   976   809  1029   579    37     0     0    91

cut()

cut()

You can use the base R function cut(), which creates categories from a numeric variable. The differences from age_categories() are:

  • You do not need to install/load another package
  • You can specify whether groups are open/closed on the right/left
  • You must provide labels yourself (and ensure they are accurate to the groups)
  • If you want 0 included in the lowest group you must specify this

The basic syntax within cut() is to first provide the numeric variable to be cut (age_years), and then the breaks argument, which is a numeric vector (c()) of break points. Using cut(), the resulting column is an ordered factor.

If used within mutate() (a dplyr verb) it is not necessary to specify the dataframe before the column name (e.g. linelist$age_years).

Simple cut() example

Simple cut() example

Create new column of age categories (age_cat) by cutting the numeric age_year column at specified break points. The example below replicates the first age_categories() example.

  • Specify numeric vector of break points c(0, 5, 10, 15, ...)
  • Default behavior for cut() is that lower break values are excluded from each category, and upper break values are included. This is the opposite behavior from the age_categories() function.
  • Include 0 in the lowest category by adding include.lowest = TRUE
  • Add a vector of customized labels using the labels = argument
  • Check your work with cross-tabulation of the numeric and category columns - be aware of missing values
linelist <- linelist %>% 
  mutate(age_cat = cut(age_years,                                       # numeric column
                        breaks = c(0, 5, 10, 15, 20, 30, 50, 70,        # break points...
                                   max(linelist$age_years, na.rm=T)),   # ... with dynamic last break as column max value
                        right = TRUE,                                   # lower breaks included and upper excluded [a,b)
                        include.lowest = TRUE,                          # 0 included in lowest category
                        labels = c("0-4", "5-9", "10-14", "15-19",      # manual labels - be careful!
                                   "20-29", "30-49", "50-69", "70+")))       

table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##  1338  1160   976   809  1029   579    37     0    91

cut() details

cut() details

Below is a detailed description of the behavior of using cut() to make the age_cat column. Key points:

  • Inclusion/exclusion behavior of break points
  • Custom category labels
  • Handling missing values
  • Check your work!

The most simple command of cut() applied to age_years to make the new variable age_cat is below:

# Create new variable, by cutting the numeric age variable
# by default, upper break is excluded and lower break excluded from each category
linelist <- linelist %>% 
  mutate(age_cat = cut(age_years, breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100)))

# tabulate the number of observations per group
table(linelist$age_cat, useNA = "always")
## 
##    (0,5]   (5,10]  (10,15]  (15,20]  (20,30]  (30,50]  (50,70] (70,100]     <NA> 
##     1223     1160      976      809     1029      579       37        0      206
  • By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). The default labels use the notation “(A, B]”, which means the group does not include A (the lower break value), but includes B (the upper break value). Reverse this behavior by providing the right = TRUE argument.

  • Thus, by default “0” values are excluded from the lowest group, and categorized as NA. “0” values could be infants coded as age 0. To change this add the argument include.lowest = TRUE. Then, any “0” values are included in the lowest group. The automatically-generated label for the lowest category will change from “(0,B]” to “[0,B]”, which signifies that 0 values are included.

  • Check your work!!! Verify that each age value was assigned to the correct category by cross-tabulating the numeric and category columns. Examine assignment of boundary values (e.g. 15, if neighboring categories are 10-15 and 15-20).

# Cross tabulation of the numeric and category columns. 
table("Numeric Values" = linelist$age_years,   # names specified in table for clarity.
      "Categories"     = linelist$age_cat,
      useNA = "always")                        # don't forget to examine NA values
##                     Categories
## Numeric Values       (0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70] (70,100] <NA>
##   0                      0      0       0       0       0       0       0        0  115
##   0.0833333333333333     1      0       0       0       0       0       0        0    0
##   0.166666666666667      1      0       0       0       0       0       0        0    0
##   0.333333333333333      3      0       0       0       0       0       0        0    0
##   0.416666666666667      1      0       0       0       0       0       0        0    0
##   0.5                    2      0       0       0       0       0       0        0    0
##   0.583333333333333      2      0       0       0       0       0       0        0    0
##   0.666666666666667      6      0       0       0       0       0       0        0    0
##   0.75                   1      0       0       0       0       0       0        0    0
##   0.833333333333333      2      0       0       0       0       0       0        0    0
##   1                    259      0       0       0       0       0       0        0    0
##   1.5                    3      0       0       0       0       0       0        0    0
##   2                    250      0       0       0       0       0       0        0    0
##   3                    229      0       0       0       0       0       0        0    0
##   4                    222      0       0       0       0       0       0        0    0
##   5                    241      0       0       0       0       0       0        0    0
##   6                      0    228       0       0       0       0       0        0    0
##   7                      0    231       0       0       0       0       0        0    0
##   8                      0    231       0       0       0       0       0        0    0
##   9                      0    246       0       0       0       0       0        0    0
##   10                     0    224       0       0       0       0       0        0    0
##   11                     0      0     208       0       0       0       0        0    0
##   12                     0      0     209       0       0       0       0        0    0
##   13                     0      0     191       0       0       0       0        0    0
##   14                     0      0     174       0       0       0       0        0    0
##   15                     0      0     194       0       0       0       0        0    0
##   16                     0      0       0     198       0       0       0        0    0
##   17                     0      0       0     179       0       0       0        0    0
##   18                     0      0       0     141       0       0       0        0    0
##   19                     0      0       0     143       0       0       0        0    0
##   20                     0      0       0     148       0       0       0        0    0
##   21                     0      0       0       0     137       0       0        0    0
##   22                     0      0       0       0     129       0       0        0    0
##   23                     0      0       0       0      99       0       0        0    0
##   24                     0      0       0       0     101       0       0        0    0
##   25                     0      0       0       0     108       0       0        0    0
##   26                     0      0       0       0     111       0       0        0    0
##   27                     0      0       0       0      95       0       0        0    0
##   28                     0      0       0       0      97       0       0        0    0
##   29                     0      0       0       0      83       0       0        0    0
##   30                     0      0       0       0      69       0       0        0    0
##   31                     0      0       0       0       0      57       0        0    0
##   32                     0      0       0       0       0      76       0        0    0
##   33                     0      0       0       0       0      71       0        0    0
##   34                     0      0       0       0       0      28       0        0    0
##   35                     0      0       0       0       0      43       0        0    0
##   36                     0      0       0       0       0      46       0        0    0
##   37                     0      0       0       0       0      44       0        0    0
##   38                     0      0       0       0       0      30       0        0    0
##   39                     0      0       0       0       0      20       0        0    0
##   40                     0      0       0       0       0      16       0        0    0
##   41                     0      0       0       0       0      24       0        0    0
##   42                     0      0       0       0       0      30       0        0    0
##   43                     0      0       0       0       0      15       0        0    0
##   44                     0      0       0       0       0      16       0        0    0
##   45                     0      0       0       0       0      17       0        0    0
##   46                     0      0       0       0       0      11       0        0    0
##   47                     0      0       0       0       0      11       0        0    0
##   48                     0      0       0       0       0      12       0        0    0
##   49                     0      0       0       0       0       3       0        0    0
##   50                     0      0       0       0       0       9       0        0    0
##   51                     0      0       0       0       0       0       4        0    0
##   52                     0      0       0       0       0       0       6        0    0
##   53                     0      0       0       0       0       0       3        0    0
##   54                     0      0       0       0       0       0       4        0    0
##   55                     0      0       0       0       0       0       4        0    0
##   56                     0      0       0       0       0       0       6        0    0
##   57                     0      0       0       0       0       0       2        0    0
##   58                     0      0       0       0       0       0       2        0    0
##   59                     0      0       0       0       0       0       1        0    0
##   61                     0      0       0       0       0       0       1        0    0
##   63                     0      0       0       0       0       0       1        0    0
##   65                     0      0       0       0       0       0       1        0    0
##   66                     0      0       0       0       0       0       1        0    0
##   67                     0      0       0       0       0       0       1        0    0
##   <NA>                   0      0       0       0       0       0       0        0   91

Read more about cut() in its Help page by entering ?cut in the R console.

Reversing break inclusion behavior in cut()

Lower break values will be included in each category (and upper break values excluded) if the argument right = is included and and set to TRUE. This is applied below - note how the values have shifted among the categories.

NOTE: If you include the include.lowest = TRUE argument and right = TRUE, the include.lowest will now apply to the highest break point value and category, not the lowest.

linelist <- linelist %>% 
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),     # same breaks
                          right = FALSE,                                     # include each *lower* break point            
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")))  # now the labels must change

table(linelist$age_cat, useNA = "always")
## 
##    0-4    5-9  10-14  15-19  20-29  30-49  50-69 70-100   <NA> 
##   1097   1177   1006    855   1108    639     46      0     91

Re-labeling NA values with cut()

Because cut() does not automatically label NA values, you may want to assign a label such as “Missing”. This requires a few extra steps because cut() automatically classified the new column age_cat as a Factor (a rigid column class with specific value labels).

First, convert age_cut from Factor to Character class, so you have flexibility to add new character values (e.g. “Missing”). Otherwise you will encounter an error. Then, use the dplyr verb replace_na() to replace NA values with a character value like “Missing”. These steps can be combined into one step, as shown below.

Note that Missing has been added, but the order of the categories is now wrong (alphabetical).

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
                          right = FALSE,                                                      
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")),
         
         # convert to class Character, and replace NA with "Missing"
         age_cat = replace_na(as.character(age_cat), "Missing"))


table(linelist$age_cat, useNA = "always")
## 
##     0-4   10-14   15-19   20-29   30-49     5-9   50-69 Missing    <NA> 
##    1097    1006     855    1108     639    1177      46      91       0

To fix this, re-convert age_cat to a factor, and define the order of the levels correctly.

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
                          right = FALSE,                                                      
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")),
         
         # convert to class Character, and replace NA with "Missing"
         age_cat = replace_na(as.character(age_cat), "Missing"),
         
         # re-classify age_cat as Factor, with correct level order and new "Missing" level
         age_cat = factor(age_cat, levels = c("0-4", "5-9", "10-14", "15-19", "20-29",
                                              "30-49", "50-69", "70-100", "Missing")))    
  

table(linelist$age_cat, useNA = "always")
## 
##     0-4     5-9   10-14   15-19   20-29   30-49   50-69  70-100 Missing    <NA> 
##    1097    1177    1006     855    1108     639      46       0      91       0

If you want a fast way to make breaks and labels, you can use something like below (adjust to your specific situation). See the page on using seq() and rep() and c() TO DO

# Make break points from 0 to 90 by 5
age_seq = seq(from = 0, to = 90, by = 5)
age_seq

# Make labels for the above categories, assuming default cut() settings
age_labels = paste0(age_seq+1, "-", age_seq + 5)
age_labels

# check that both vectors are the same length
length(age_seq) == length(age_labels)

# # Use them in the cut() command
# cut(linelist$age, breaks = age_seq, labels = age_labels)

case_when()

case_when()

The dplyr function case_when() can also be used to create numeric categories.

  • Allows explicit setting of break point inclusion/exclusion
  • Allows designation of label for NA values in one step
  • More complicated code, arguably more prone to error
  • Allow more flexibility to include other variables in the logic

If using case_when() please review the in-depth page on it, as the logic and order of assignment are important understand to avoid errors.

CAUTION: In case_when() all right-hand side values must be of the same class. Thus, if your categories are character values (e.g. “20-30 years”) then any designated outcome for NA age values must also be character (“Missing”, or the special NA_character_ instead of NA).

You will need to designate the column as a factor (by wrapping case_when() in the function factor()) and provide the ordering of the factor levels using the levels = argument after the close of the case_when() function. When using cut(), the factor and ordering of levels is done automatically.

linelist <- linelist %>% 
  mutate(age_cat = factor(case_when(
          # provide the case_when logic and outcomes
          age_years >= 0 & age_years < 5     ~ "0-4",          # logic by age_year value
          age_years >= 5 & age_years < 10    ~ "5-9",
          age_years >= 10 & age_years < 15   ~ "10-14",
          age_years >= 15 & age_years < 20   ~ "15-19",
          age_years >= 20 & age_years < 30   ~ "20-29",
          age_years >= 30 & age_years < 50   ~ "30-49",
          age_years >= 50 & age_years < 70   ~ "50-69",
          age_years >= 45 & age_years <= 100 ~ "70-100",
          is.na(age_years)                   ~ "Missing",  # if age_years is missing
          TRUE                               ~ "Check value"   # catch-all alarm to trigger review
          ), levels = c("0-4","5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70-100", "Missing", "Check value"))
         )


table(linelist$age_cat, useNA = "always")
## 
##         0-4         5-9       10-14       15-19       20-29       30-49       50-69      70-100     Missing Check value        <NA> 
##        1097        1177        1006         855        1108         639          46           0          91           0           0

Add to pipe chain

Add to pipe chain

Below, code to create two categorical age columns is added to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
        select(-c(row_num, merged_header, x28)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
  
    
    # create column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 

    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
    
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B"))) %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################   
    mutate(
          # age categories: custom
          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
        
          # age categories: 0 to 85 by 5s
          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5)))

rowwise() dplyr()

rowwise() dplyr

https://cran.r-project.org/web/packages/dplyr/vignettes/rowwise.html

linelist <- linelist %>%
  rowwise() %>%
  mutate(num_symptoms = sum(c(fever, chills, cough, aches, vomit) == "yes"))

Transforming multiple variables at once

Transforming multiple variables at once

CAUTION: This tab may follow from previous tabs.

A transformation can be applied to multiple variables at once using the across() function from the package dplyr (contained within tidyverse package).

across() can be used with any dplyr verb, but commonly with as mutate(), filter(), or summarise(). Here are some examples to get started.

Example of how one would change all columns to character class

#to change all columns to character class
linelist <- linelist %>% 
  mutate(across(everything(), as.character))

Change only numeric columns

Here are a few online resources on using across(): Hadley Wickham’s thoughts/rationale

Deduplication

Deduplication

CAUTION: This tab may follow from previous tabs.

The package dplyr offers the distinct() function to reduce the dataframe to only unique rows - removing duplicates.
In this case we just want to remove rows that are complete duplicates, so we just add the simple command distinct().

For more complex deduplications see the page on deduplicating.

We begin with 6019 rows in linelist.

linelist <- linelist %>% 
  distinct()

After deduplication there are 5889 rows.

Below, the distinct() command is added to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
        select(-c(row_num, merged_header, x28)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
  
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # create column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 

    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
    
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B"))) %>% 
  
    mutate(
          # age categories: custom
          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
        
          # age categories: 0 to 85 by 5s
          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5))) %>% 
  
    distinct()

Characters/strings

Overview

This tab demonstrates use of the stringr package to evaluate and manage character (strings).

  1. Evaluate and subset/extract - str_length(), str_sub(), word()
  2. Combine, order, arrange - str_c(), str_glue(), str_order()
  3. Modify and replace - str_sub(), str_replace_all()
  4. Adjust length - str_pad(), str_trunc(), str_wrap()
  5. Change case - str_to_upper(), str_to_title(), str_to_lower(), str_to_sentence()
  6. Search for patterns - str_detect(), str_subset(), str_match()

For ease of display most examples are shown acting on a short defined character vector, however they can easily be applied/adapted to a column within a dataset.

Much of this page is adapted from this online vignette

Preparation

Install or load the stringr package.

# install or load the stringr package
pacman::p_load(stringr,   # many functions for handling strings
               tidyverse,  # for optional data manipulation
               tools      # alternative for converting to title case
               )

A reference sheet for stringr functions can be found here

Evaluate and subset

Evaluate the length of a string

str_length("abc")
## [1] 3

Alternatively, use nchar() from base R

Subset/extract string by position

Use str_sub() to return only a part of a string. The function takes three main arguments:

  1. the character vector(s)
  2. start position
  3. end position

A few notes on position numbers:

  • If a position number is positive, the position is counted starting from the left end of the string.
  • If a position number is negative, it is counted starting from the right end of the string.
  • Position numbers are inclusive.
  • Positions extending beyond the string will be truncated (removed).

Below are some examples applied to the string “pneumonia”:

# third from left
str_sub("pneumonia", 3, 3)
## [1] "e"
# 0 is not present
str_sub("pneumonia", 0, 0)
## [1] ""
# 6th from right, to the first from right
str_sub("pneumonia", 6, -1)
## [1] "onia"
# fifth from right, to the first from right
str_sub("pneumonia", -5, -1)
## [1] "monia"
# positions outside the string
str_sub("pneumonia", 4, 15)
## [1] "umonia"

Subset string by word position

To extract the nth ‘word’, use word(), also from stringr. Provide the string(s), then the first word position to extract, and the last word position to extract.

By default, the separator between ‘words’ is assumed to be a space, unless otherwise indicated with sep = (e.g. sep = "_" when words are separated by underscores.

# strings to evaluate
chief_complaints <- c("I just got out of the hospital 2 days ago, but still can barely breathe.",
                      "My stomach hurts",
                      "Severe ear pain")

# extract 1st-3rd words of each string
word(chief_complaints, start = 1, end = 3, sep = " ")
## [1] "I just got"       "My stomach hurts" "Severe ear pain"

Combine, order, and arrange

This section covers using str_c(), str_glue(), str_order(), to combine, arrange, and paste together strings.

Combine strings

It is common to see base R functions paste() and paste0(), which concatenate vectors after converting all parts to character. The act similarly to str_c() but the syntax differs - the parts (either text or code/pre-defined objects) are separated by commas, for example: paste("Regional hospital needs", n_beds, "beds and", n_masks, "masks."). The sep and collapse arguments can be adjusted. By default sep is a space, unless using paste0() where there is no space between parts.

To combine multiple strings into one string, you can use str_c(), which is the stringr version of c() (concatenate).

str_c("String1", "String2", "String3")
## [1] "String1String2String3"

The argument sep = inserts characters between each input vectors (e.g. a comma or newline "\n")

str_c("String1", "String2", "String3", sep = ", ")
## [1] "String1, String2, String3"

The argument collapse = is relevant if producing multiple elements. The example below shows the combination of first and last names. The sep value goes between each first and last name, while the collapse value goes between the people.

first_names <- c("abdul", "fahruk", "janice") 
last_names  <- c("hussein", "akinleye", "musa")

# sep is between the respective strings, while collapse is between the elements produced
str_c(first_names, last_names, sep = " ", collapse = ";  ")
## [1] "abdul hussein;  fahruk akinleye;  janice musa"
# For newlines to print correctly, the phrase may need to be wrapped in cat()
cat(str_c(first_names, last_names, sep = " ", collapse = ";\n"))
## abdul hussein;
## fahruk akinleye;
## janice musa

Glueing strings and other values

str_glue()

You can also combine strings and other pre-defined values and characters with str_glue(). This is a very useful function for creating dynamic plot captions, as demonstrated below.

  • All content goes between quotation marks ("").
  • Any dynamic code or calls to pre-defined objects must be within curly brackets {}. There can be many curly brackets.
  • Within the outer quotation marks, use single quotes if necessary (e.g. when providing date format)
  • You can provide newlines (\n), use format() to display dates, use Sys.Date() to display the current date.
  • If using the %>% pipe operator, ensure the tidyverse package is loaded.

A simple example:

str_glue("The linelist is current to {format(Sys.Date(), '%d %b %Y')} and includes {nrow(linelist)} cases.")
## The linelist is current to 31 Jan 2021 and includes 5889 cases.

An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue() function, as below. This can improve code readability if the codes are long.

str_glue("Data source is the confirmed case linelist as of {current_date}.\nThe last case was reported hospitalized on {last_hospital}.\n{n_missing_onset} cases are missing date of onset and not shown",
         current_date = format(Sys.Date(), '%d %b %Y'),
         last_hospital = format(as.Date(max(linelist$date_hospitalisation, na.rm=T)), '%d %b %Y'),
         n_missing_onset = nrow(linelist %>% filter(is.na(date_onset)))
         )
## Data source is the confirmed case linelist as of 31 Jan 2021.
## The last case was reported hospitalized on 30 Apr 2015.
## 248 cases are missing date of onset and not shown

Sometimes, it is useful to pull data from dataframe and have it pasted together in sequence. Below is an example using this dataset to make a summary output of jurisdictions and the new and total cases:

DT::datatable(case_table, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Use str_c() with the dataframe and column names (as in the example above with first & last names). Provide sep and collapse arguments.

str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  ")
## [1] "Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"

We add the text “New Cases:” to the beginning of the summary by using wrapping with a separate str_c(). If “New Cases” was added within the original str_c(), it would appear multiple times.

str_c("New Cases: ", str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  "))
## [1] "New Cases: Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"

You can achieve a similar result with str_glue(), with newlines added automatically:

str_glue("{case_table$zone}: {case_table$new_cases} new cases ({case_table$total_cases} total cases)")
## Zone 1: 3 new cases (40 total cases)
## Zone 2: 0 new cases (4 total cases)
## Zone 3: 7 new cases (25 total cases)
## Zone 4: 0 new cases (10 total cases)
## Zone 5: 15 new cases (103 total cases)

To use str_glue() but have more control (e.g. to use double newlines), wrap it within str_c() and adjust the collapse value. You may need to print using cat() to correctly print the newlines.

case_summary <- str_c(str_glue("{case_table$zone}: {case_table$new_cases} new cases ({case_table$total_cases} total cases)"), collapse = "\n\n")

cat(case_summary) # print
## Zone 1: 3 new cases (40 total cases)
## 
## Zone 2: 0 new cases (4 total cases)
## 
## Zone 3: 7 new cases (25 total cases)
## 
## Zone 4: 0 new cases (10 total cases)
## 
## Zone 5: 15 new cases (103 total cases)

Sorting

Several strings can be sorted by alphabetical order. str_order() returns the order, while str_sort() returns the strings in that order.

# strings
health_zones <- c("Alba", "Takota", "Delta")

# return the alphabetical order
str_order(health_zones)
## [1] 1 3 2
# return the strings in alphabetical order
str_sort(health_zones)
## [1] "Alba"   "Delta"  "Takota"

To use a different alphabet, add the argument locale =. See the full list of locales by entering stringi::stri_locale_list() in the R console.

base R functions

It is common to see base R functions paste() and paste0(), which concatenate vectors after converting all parts to character. The act similarly to str_c() but the syntax differs - the parts (either text or code/pre-defined objects) are separated by commas, for example: paste("Regional hospital needs", n_beds, "beds and", n_masks, "masks."). The sep and collapse arguments can be adjusted. By default sep is a space, unless using paste0() where there is no space between parts.

Modify and replace

Replace specific character positions

str_sub() paired with the assignment operator (<-) can be used to modify a part of a string:

word <- "pneumonia"

# convert the third and fourth characters to X 
str_sub(word, 3, 4) <- "XX"

word
## [1] "pnXXmonia"

An example applied to multiple strings (e.g. a column). Note the expansion in length of “HIV”.

words <- c("pneumonia", "tubercolosis", "HIV")

# convert the third and fourth characters to X 
str_sub(words, 3, 4) <- "XX"

words
## [1] "pnXXmonia"    "tuXXrcolosis" "HIXX"

Replace patterns

Use str_replace_all() as a “find and replace” tool. First, provide the strings to be evaluated, then the pattern to be replaced, and then the replacement value. The example below replaces all instances of “dead” with “deceased”. Note, this IS case sensitive.

outcome <- c("Karl: dead",
            "Samantha: dead",
            "Marco: not dead")

str_replace_all(outcome, "dead", "deceased")
## [1] "Karl: deceased"      "Samantha: deceased"  "Marco: not deceased"

To replace a pattern with NA, use str_replace_na(). The function str_replace() replaces only the first instance of the pattern within each evaluated string.

Adjust length

Increase minimum length (pad)

Use str_pad() to add characters to a string, to a minimum length.

By default spaces are added, but you can also pad with other characters using the pad = argument.

# ICD codes of differing length
ICD_codes <- c("R10.13",
               "R10.819",
               "R17")

# ICD codes padded to 7 characters on the right side
str_pad(ICD_codes, 7, "right")
## [1] "R10.13 " "R10.819" "R17    "
# Pad with periods instead of spaces
str_pad(ICD_codes, 7, "right", pad = ".")
## [1] "R10.13." "R10.819" "R17...."

For example, to pad numbers with leading zeros (such as for hours or minutes), you can pad the number to minimum length of 2 with pad = "0".

# Add leading zeros to two digits (e.g. for times minutes/hours)
str_pad("4", 2, pad = "0") 
## [1] "04"
# example using a numeric column named "hours"
# hours <- str_pad(hours, 2, pad = "0")

Truncate/shorten

str_trunc() sets a maximum length for each string. If a string exceeds this length, it is truncated (shortened) and an ellipsis (…) is included to indicate that the string was previously longer. Note that the ellipsis is counted in the length. The ellipsis characters can be changed with the argument ellipsis =. The optional side = argument specifies which where the ellipsis will appear within the truncated string (“left”, “right”, or “center”).

original <- "Symptom onset on 4/3/2020 with vomiting"
str_trunc(original, 10, "center")
## [1] "Symp...ing"

To ensure each value is the same length

Use str_trunc() to set a maximum length, and then use str_pad() to expand the very short strings to that truncated length. In the example below, 6 is set as the maximum length (one value is truncated), and then a very short value is padded to achieve length of 6.

# ICD codes of differing length
ICD_codes   <- c("R10.13",
                 "R10.819",
                 "R17")

# truncate to maximum length of 6
ICD_codes_2 <- str_trunc(ICD_codes, 6)
ICD_codes_2
## [1] "R10.13" "R10..." "R17"
# expand to minimum length of 6
ICD_codes_3 <- str_pad(ICD_codes_2, 6, "right")
ICD_codes_3
## [1] "R10.13" "R10..." "R17   "

Remove leading/trailing whitespace

Use str_trim() to remove spaces, newlines (\n) or tabs (\t) on sides of a string input.
Add "right" "left", or "both" to the command to specify which side to trim (e.g. str_trim(x, "right").

# ID numbers with excess spaces on right
IDs <- c("provA_1852  ", # two excess spaces
         "provA_2345",   # zero excess spaces
         "provA_9460 ")  # one excess space

# IDs trimmed to remove excess spaces on right side only
str_trim(IDs)
## [1] "provA_1852" "provA_2345" "provA_9460"

Remove repeated whitespace within strings

Use str_squish() to remove repeated spaces that appear inside a string. For example, to convert double spaces into single spaces. It also removes spaces, newlines, or tabs on the outside of the string like str_trim().

# original contains excess spaces within string
str_squish("  Pt requires   IV saline\n") 
## [1] "Pt requires IV saline"

Enter ?str_trim, ?str_pad in your R console to see further details.

Wrap lines into paragraphs

Use str_wrap() to wrap a long unstructured text into a structured paragraph with fixed line length. Provide the ideal character length for each line, and it applies an algorithm to insert newlines (\n) within the paragraph, as seen in the example below.

pt_course <- "Symptom onset 1/4/2020 vomiting chills fever. Pt saw traditional healer in home village on 2/4/2020. On 5/4/2020 pt symptoms worsened and was admitted to Lumta clinic. Sample was taken and pt was transported to regional hospital on 6/4/2020. Pt died at regional hospital on 7/4/2020."

str_wrap(pt_course, 40)
## [1] "Symptom onset 1/4/2020 vomiting chills\nfever. Pt saw traditional healer in\nhome village on 2/4/2020. On 5/4/2020\npt symptoms worsened and was admitted\nto Lumta clinic. Sample was taken and pt\nwas transported to regional hospital on\n6/4/2020. Pt died at regional hospital\non 7/4/2020."

The base function cat() can be wrapped around the above command in order to print the output, displaying the new lines added.

cat(str_wrap(pt_course, 40))
## Symptom onset 1/4/2020 vomiting chills
## fever. Pt saw traditional healer in
## home village on 2/4/2020. On 5/4/2020
## pt symptoms worsened and was admitted
## to Lumta clinic. Sample was taken and pt
## was transported to regional hospital on
## 6/4/2020. Pt died at regional hospital
## on 7/4/2020.

Change case

Often one must alter the case/capitalization of a string value, for example names of jursidictions. Use str_to_upper(), str_to_upper(), and str_to_title(), as shown below:

str_to_upper("California")
## [1] "CALIFORNIA"
str_to_lower("California")
## [1] "california"

Using *base** R, the above can also be achieved with toupper(), tolower().

Title case Transforming the string so each word is capitalized can be achieved with str_to_title():

str_to_title("go to the US state of california ")
## [1] "Go To The Us State Of California "

Use toTitleCase() from the tools package to achieve more nuanced capitalization (words like “to”, “the”, and “of” are not capitalized).

tools::toTitleCase("This is the US state of california")
## [1] "This is the US State of California"

You can also use str_to_sentence(), which capitalizes only the first letter of the string.

str_to_sentence("the patient must be transported")
## [1] "The patient must be transported"

Patterns

Many stringr functions work to detect, locate, extract, match, replace, and split based on a specified pattern.

Detect presence/absence of a pattern

Use str_detect() as below. Note that by default the search is case sensitive!

str_detect("primary school teacher", "teach")
## [1] TRUE

The argument negate = can be included and set to TRUE if you want to know if the pattern is NOT present.

str_detect("primary school teacher", "teach", negate = TRUE)
## [1] FALSE

To ignore case/capitalization, wrap the pattern within regex() and within regex() add the argument ignore_case = T.

str_detect("Teacher", regex("teach", ignore_case = T))
## [1] TRUE

When str_detect() is applied to a character vector/column, it will return a TRUE/FALSE for each of the values in the vector.

# a vector/column of occupations 
occupations <- c("field laborer",
                 "university professor",
                 "primary school teacher & tutor",
                 "tutor",
                 "nurse at regional hospital",
                 "lineworker at Amberdeen Fish Factory",
                 "physican",
                 "cardiologist",
                 "office worker",
                 "food service")

# Detect presence of pattern "teach" in each string - output is vector of TRUE/FALSE
str_detect(occupations, "teach")
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

If you need to count these, apply sum() to the output. This counts the number TRUE.

sum(str_detect(occupations, "teach"))
## [1] 1

To search inclusive of multiple terms, include them separated by OR bars (|) within the pattern, as shown below:

sum(str_detect(occupations, "teach|professor|tutor"))
## [1] 3

If you need to make a long list of search terms, you can combine them using str_c() and sep = |, define this is a character object, and reference it later more succinctly. The example below includes possible occupation search terms for frontline medical providers.

# search terms
occupation_med_frontline <- str_c("medical", "medicine", "hcw", "healthcare", "home care", "home health",
                                "surgeon", "doctor", "doc", "physician", "surgery", "peds", "pediatrician",
                               "intensivist", "cardiologist", "coroner", "nurse", "nursing", "rn", "lpn",
                               "cna", "pa", "physician assistant", "mental health",
                               "emergency department technician", "resp therapist", "respiratory",
                                "phlebotomist", "pharmacy", "pharmacist", "hospital", "snf", "rehabilitation",
                               "rehab", "activity", "elderly", "subacute", "sub acute",
                                "clinic", "post acute", "therapist", "extended care",
                                "dental", "dential", "dentist", sep = "|")

occupation_med_frontline
## [1] "medical|medicine|hcw|healthcare|home care|home health|surgeon|doctor|doc|physician|surgery|peds|pediatrician|intensivist|cardiologist|coroner|nurse|nursing|rn|lpn|cna|pa|physician assistant|mental health|emergency department technician|resp therapist|respiratory|phlebotomist|pharmacy|pharmacist|hospital|snf|rehabilitation|rehab|activity|elderly|subacute|sub acute|clinic|post acute|therapist|extended care|dental|dential|dentist"

This command returns the number of occupations which contain any one of the search terms for front-line medical providers (occupation_med_frontline):

sum(str_detect(occupations, occupation_med_frontline))
## [1] 2

Base R string search functions

The base function grepl() works similarly to str_detect(), in that it searches for matches to a pattern and returns a logical vector. The basic syntax is grepl(pattern, strings_to_search, ignore.case = FALSE, ...). One advantage is that the ignore.case argument is easier to write (there is no need to involve regex() function).

Likewise, the base functions sub() and gsub() act similarly to str_replace(). Their basic syntax is: gsub(pattern, replacement, strings_to_search, ignore.case = FALSE). sub() will replace the first instance of the pattern, whereas gsub() will replace all instances of the pattern.

Detects patterns in conditional logic

Within case_when()

str_detect() is often used within case_when() (from dplyr). Let’s say the occupations are a column in the linelist called occupations. The mutate() below creates a new column called is_educator by using conditional logic via case_when(). See the page on data cleaning to learn more about case_when().

df <- df %>% 
  mutate(is_educator = case_when(
    # term search within occupation, not case sensitive
    str_detect(occupations,
               regex("teach|prof|tutor|university",
                     ignore_case = TRUE))              ~ "Educator",
    # all others
    TRUE                                               ~ "Not an educator"))

As a reminder, it may be important to add exclusion criteria to the conditional logic (negate = F):

df <- df %>% 
  # value in new column is_educator is based on conditional logic
  mutate(is_educator = case_when(
    
    # occupation column must meet 2 criteria to be assigned "Educator":
    # it must have a search term AND NOT any exclusion term
    
    # Must have a search term AND
    str_detect(occupations,
               regex("teach|prof|tutor|university", ignore_case = T)) &              
    # Must NOT have an exclusion term
    str_detect(occupations,
               regex("admin", ignore_case = T),
               negate = T)                          ~ "Educator"
    
    # All rows not meeting above criteria
    TRUE                                            ~ "Not an educator"))

Locate pattern position

To locate the first position of a pattern, use str_locate(). It outputs a start and end position.

str_locate("I wish", "sh")
##      start end
## [1,]     5   6

Like other str functions, there is an "_all" version (str_locate_all()) which will return the positions of all instances of the pattern within each string. This outputs as a list.

phrases <- c("I wish", "I hope", "he hopes", "He hopes")

str_locate(phrases, "h" )     # position of *first* instance of the pattern
##      start end
## [1,]     6   6
## [2,]     3   3
## [3,]     1   1
## [4,]     4   4
str_locate_all(phrases, "h" ) # position of *every* instance of the pattern
## [[1]]
##      start end
## [1,]     6   6
## 
## [[2]]
##      start end
## [1,]     3   3
## 
## [[3]]
##      start end
## [1,]     1   1
## [2,]     4   4
## 
## [[4]]
##      start end
## [1,]     4   4

Extract a match

str_extract_all() returns the matching patterns themselves, which is most useful when you have offered several patterns via “OR” conditions. For example, looking in the string vector of occupations (see previous tab) for either “teach”, “prof”, or “tutor”.

str_extract_all() returns a list which contains all matches for each evaluated string. See below how occupation 3 has two pattern matches within it.

str_extract_all(occupations, "teach|prof|tutor")
## [[1]]
## character(0)
## 
## [[2]]
## [1] "prof"
## 
## [[3]]
## [1] "teach" "tutor"
## 
## [[4]]
## [1] "tutor"
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)
## 
## [[7]]
## character(0)
## 
## [[8]]
## character(0)
## 
## [[9]]
## character(0)
## 
## [[10]]
## character(0)

str_extract() extracts only the first match in each evaluated string, producing a character vector with one element for each evaluated string. It returns NA where there was no match. The NAs can be removed by wrapping the returned vector with na.exclude(). Note how the second of occupation 3’s matches is not shown.

str_extract(occupations, "teach|prof|tutor")
##  [1] NA      "prof"  "teach" "tutor" NA      NA      NA      NA      NA      NA

Subset and Count

Subset, Count

Aligned functions include str_subset() and str_count().

str_subset() returns the actual values which contained the pattern:

str_subset(occupations, "teach|prof|tutor")
## [1] "university professor"           "primary school teacher & tutor" "tutor"

`str_count() returns a vector of numbers: the number of times a search term appears in each evaluated value.

str_count(occupations, regex("teach|prof|tutor", ignore_case = TRUE))
##  [1] 0 1 2 1 0 0 0 0 0 0

Splitting

To split a string based on a pattern, use str_split(). It evaluates the strings and returns a list of character vectors consisting of the newly-split values.

The simple example below evaluates one string, and produces a list with one element - a character vector with three values:

str_split("jaundice, fever, chills", ",", simplify = T)
##      [,1]       [,2]     [,3]     
## [1,] "jaundice" " fever" " chills"

You can assign this as a named object, and access the nth symptom:

pt1_symptoms <- str_split("jaundice, fever, chills", ",", simplify = T)

pt1_symptoms[2]
## [1] " fever"

If multiple strings are evaluated, there will be more than one element in the returned list.

symptoms <- c("jaundice, fever, chills",     # patient 1
              "chills, aches, pains",        # patient 2 
              "fever",                       # patient 3
              "vomiting, diarrhoea",         # patient 4
              "bleeding from gums, fever",   # patient 5
              "rapid pulse, headache")       # patient 6

str_split(symptoms, ",")                     # split each patient's symptoms
## [[1]]
## [1] "jaundice" " fever"   " chills" 
## 
## [[2]]
## [1] "chills" " aches" " pains"
## 
## [[3]]
## [1] "fever"
## 
## [[4]]
## [1] "vomiting"   " diarrhoea"
## 
## [[5]]
## [1] "bleeding from gums" " fever"            
## 
## [[6]]
## [1] "rapid pulse" " headache"

To access a specific symptom you can use syntax like this: the_split_return_object[[2]][1], which would access the first symptom from the second evaluated string (“chills”). See the R basics page for more detail on accessing elements.

To return a “character matrix” instead, which may be useful if creating dataframe columns, set the argument simplify = TRUE as shown below:

str_split(symptoms, ",", simplify = T)
##      [,1]                 [,2]         [,3]     
## [1,] "jaundice"           " fever"     " chills"
## [2,] "chills"             " aches"     " pains" 
## [3,] "fever"              ""           ""       
## [4,] "vomiting"           " diarrhoea" ""       
## [5,] "bleeding from gums" " fever"     ""       
## [6,] "rapid pulse"        " headache"  ""

You can also adjust the number of splits to create with the n = argument. For example, this restricts the number of splits (from the left side) to 2 splits. The further commas remain within the second split.

str_split(symptoms, ",", simplify = T, n = 2)
##      [,1]                 [,2]            
## [1,] "jaundice"           " fever, chills"
## [2,] "chills"             " aches, pains" 
## [3,] "fever"              ""              
## [4,] "vomiting"           " diarrhoea"    
## [5,] "bleeding from gums" " fever"        
## [6,] "rapid pulse"        " headache"

Note - the same outputs can be achieved with str_split_fixed(), in which you do not* give the simplify argument, but must instead designate the number of columns (n).*

str_split_fixed(symptoms, ",", n = 2)

Splitting a column within a dataframe

Within a dataframe, to split one character column into other columns use use separate() from dplyr.

If we have a simple dataframe df consisting of a case ID column, one character column with symptoms, and one outcome column:

First provide the column to be separated, then provide a vector (c()) of new columns names to the argument into =, as shown below. The argument sep = can be a character, or a number (interpreted as the character position to split at).

Optional arguments include remove = (FALSE by default, removes the input column) and convert = (FALSE by default, will cause string “NA”s to become NA).

extra = will control what happens if there are more many values created by the separation than new columns named. Setting extra equal to "warn" means R will return a warning but proceed and drop the values (the default). "drop" means the values will be dropped with no warning.

Setting extra = "merge" will only split to the number of new columns listed in into - this setting will preserve all your data.

# third symptoms combined into second new column
df %>% 
  separate(symptoms, into = c("sym_1", "sym_2"), sep=",", extra = "merge")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
##   case_ID              sym_1          sym_2 outcome
## 1       1           jaundice  fever, chills Success
## 2       2             chills   aches, pains Failure
## 3       3              fever           <NA> Failure
## 4       4           vomiting      diarrhoea Success
## 5       5 bleeding from gums          fever Success
## 6       6        rapid pulse       headache Success
# third symptoms are lost
df %>% 
  separate(symptoms, into = c("sym_1", "sym_2", "sym_3"), sep=",")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 4 rows [3, 4, 5, 6].
##   case_ID              sym_1      sym_2   sym_3 outcome
## 1       1           jaundice      fever  chills Success
## 2       2             chills      aches   pains Failure
## 3       3              fever       <NA>    <NA> Failure
## 4       4           vomiting  diarrhoea    <NA> Success
## 5       5 bleeding from gums      fever    <NA> Success
## 6       6        rapid pulse   headache    <NA> Success
# third symptoms given their own column
separated <- df %>% 
  separate(symptoms, into = c("sym_1", "sym_2", "sym_3"), sep=",")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 4 rows [3, 4, 5, 6].
separated
##   case_ID              sym_1      sym_2   sym_3 outcome
## 1       1           jaundice      fever  chills Success
## 2       2             chills      aches   pains Failure
## 3       3              fever       <NA>    <NA> Failure
## 4       4           vomiting  diarrhoea    <NA> Success
## 5       5 bleeding from gums      fever    <NA> Success
## 6       6        rapid pulse   headache    <NA> Success

CAUTION: If you do not provide enough into values for the new columns, your data may be truncated.

One solution to automatically make as many columns as needed could be:

unite()

Within a dataframe, bringing together multiple columns (the opposite of separate()) can be achieved with unite() from tidyr.

Provide the name of the new united column. Then provide the names of the columns you wish to unite. By default the separator used in the united column is "_", but this can be changed with the sep argument. Other optional arguments include remove = (TRUE by default, removes the input columns from the data frame), and na.rm = (FALSE by default, it removes missing values while uniting).

Below, we re-unite the dataframe that was separated above.

separated %>% 
  unite(
    col = "all_symptoms",         # name of the new united column
    c("sym_1", "sym_2", "sym_3"), # columns to unite
    sep = ", ",                   # separator to use in united column
    remove = TRUE,                # if TRUE, removes input cols from the data frame
    na.rm = TRUE                  # if TRUE, missing values are removed before uniting
  )
##   case_ID               all_symptoms outcome
## 1       1  jaundice,  fever,  chills Success
## 2       2     chills,  aches,  pains Failure
## 3       3                      fever Failure
## 4       4       vomiting,  diarrhoea Success
## 5       5 bleeding from gums,  fever Success
## 6       6     rapid pulse,  headache Success

Regex groups

Groups within strings

str_match() TBD

Regex and special characters

Regular expressions, or “regex”, is a concise language for describing patterns in strings.

Much of this tab is adapted from this tutorial and this cheatsheet

Special characters

Backslash \ as escape

The backslash \ is used to “escape” the meaning of the next character. This way, a backslash can be used to have a quote mark display within other quote marks (\") - the middle quote mark will not “break” the surrounding quote marks.

Note - thus, if you want to display a backslash, you must escape it’s meaning with *another backslash. So you must write two backslashes \\ to display one.

Special characters

Special character Represents
"\\" backslash
"\n" a new line (newline)
"\"" double-quote within double quotes
'\'' single-quote within single quotes
"\| grave accent| carriage return| tab| vertical tab"` backspace

Run ?"'" in the R Console to display a complete list of these special characters (it will appear in the RStudio Help pane).

Regular expressions (regex)

If you are not familiar with it, a regular expression can look like an alien language:

A regular expression is applied to extract specific patterns from unstructured text - for example medical notes, chief complaint, matient history, or other free text columns in a dataset.

There are four basic tools one can use to create a basic regular expression:

  1. Character sets
  2. Meta characters
  3. Quantifiers
  4. Groups

Character sets

Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:

Character set Matches for
"[A-Z]" any single capital letter
"[a-z]" any single lowercase letter
"[0-9]" any digit
[:alnum:] any alphanumeric character
[:digit:] any numeric digit
[:alpha:] any letter (upper or lowercase)
[:upper:] any uppercase letter
[:lower:] any lowercase letter

Character sets can be combined within one bracket (no spaces!), such as "[A-Za-z]" (any upper or lowercase letter), or another example "[t-z0-5]" (lowercase t through z OR number 0 through 5).

Meta characters

Meta characters are shorthand for character sets. Some of the important ones are listed below:

Meta character Represents
"\\s" a single space
"\\w" any single alphanumeric character (A-Z, a-z, or 0-9)
"\\d" any single numeric digit (0-9)

Quantifiers

Typically you do not want to search for a match on only one character. Quantifiers allow you to designate the length of letters/numbers to allow for the match.

Quantifiers are numbers written within curly brackets { } after the character they are quantifying, for example,

  • "A{2}" will return instances of two capital A letters.
  • "A{2,4}" will return instances of between two and four capital A letters (do not put spaces!).
  • "A{2,}" will return instances of two or more capital A letters.
  • "A+" will return instances of one or more capital A letters (group extended until a different character is encountered).
  • Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present)

Using the + plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"

# test string for quantifiers
test <- "A-AA-AAA-AAAA"

When a quantifier of {2} is used, only pairs of consecutive A’s are returned. Two pairs are identified within AAAA.

str_extract_all(test, "A{2}")
## [[1]]
## [1] "AA" "AA" "AA" "AA"

When a quantifier of {2,4} is used, groups of consecutive A’s that are two to four in length are returned.

str_extract_all(test, "A{2,4}")
## [[1]]
## [1] "AA"   "AAA"  "AAAA"

With the quantifier +, groups of one or more are returned:

str_extract_all(test, "A+")
## [[1]]
## [1] "A"    "AA"   "AAA"  "AAAA"

Relative position

These express requirements for what precedes or follows a pattern. For example, to extract sentences, “two numbers that are followed by a period” (""). (?<=\.)\s(?=[A-Z])

str_extract_all(test, "")
## [[1]]
##  [1] "A" "-" "A" "A" "-" "A" "A" "A" "-" "A" "A" "A" "A"
Position statement Matches to
"(?<=b)a" “a” that is preceded by a “b”
"(?<!b)a" “a” that is NOT preceded by a “b”
"a(?=b)" “a” that is followed by a “b”
"a(?!b)" “a” that is NOT followed by a “b”

Groups

Capturing groups in your regular expression is a way to have a more organized output upon extraction.

Regex examples

Below is a free text for the examples. We will try to extract useful information from it using a regular expression search term.

pt_note <- "Patient arrived at Broward Hospital emergency ward at 18:00 on 6/12/2005. Patient presented with radiating abdominal pain from LR quadrant. Patient skin was pale, cool, and clammy. Patient temperature was 99.8 degrees farinheit. Patient pulse rate was 100 bpm and thready. Respiratory rate was 29 per minute."

This expression matches to all words (any character until hitting non-character such as a space):

str_extract_all(pt_note, "[A-Za-z]+")
## [[1]]
##  [1] "Patient"     "arrived"     "at"          "Broward"     "Hospital"    "emergency"   "ward"        "at"          "on"         
## [10] "Patient"     "presented"   "with"        "radiating"   "abdominal"   "pain"        "from"        "LR"          "quadrant"   
## [19] "Patient"     "skin"        "was"         "pale"        "cool"        "and"         "clammy"      "Patient"     "temperature"
## [28] "was"         "degrees"     "farinheit"   "Patient"     "pulse"       "rate"        "was"         "bpm"         "and"        
## [37] "thready"     "Respiratory" "rate"        "was"         "per"         "minute"

The expression "[0-9]{1,2}" matches to consecutive numbers that are 1 or 2 digits in length. It could also be written "\\d{1,2}", or "[:digit:]{1,2}".

str_extract_all(pt_note, "[0-9]{1,2}")
## [[1]]
##  [1] "18" "00" "6"  "12" "20" "05" "99" "8"  "10" "0"  "29"
str_split(pt_note, ".")
## [[1]]
##   [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
##  [44] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
##  [87] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [130] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [173] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [216] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [259] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [302] "" "" "" "" "" "" "" ""

This expression will extract all sentences (assuming first letter is capitalized, and the sentence ends with a period). The pattern reads in English as: "A capital letter followed by some lowercase letters, a space, some letters, a space,

str_extract_all(pt_note, "[A-Z][a-z]+\\s\\w+\\s\\d{1,2}\\s\\w+\\s*\\w*")
## [[1]]
## character(0)

You can view a useful list of regex expressions and tips on page 2 of this cheatsheet

Also see this tutorial.

Resources

A reference sheet for stringr functions can be found here

A vignette on stringr can be found here

De-duplication

Overview

This page covers the following subjects:

  1. Identifying and removing duplicate rows
  2. “Slicing” and keeping only certain rows (min, max, random…), also from each group
  3. “Rolling-up”, or combining values from multiple rows into one

Preparation

Load packages

pacman::p_load(tidyverse,   # deduplication, grouping, and slicing functions
               janitor,     # function for reviewing duplicates
               stringr      # for string searches, can be used in "rolling-up" values
               )     

Example dataset

For demonstration, we will use the fake dataset below. It is a record of COVID-19 phone encounters, including with contacts and with cases.

  • The first two records are 100% complete duplicates including duplicate recordID (computer glitch)
  • The second two rows are duplicates, in all columns except for recordID
  • Several people had multiple phone encounters, at various dates/times and as contacts or cases
  • At each encounter, the person was asked if they had ever had symptoms, and some of this information is missing.

Deduplication

This tab uses the dataset from the Preparation tab to describe how to review and remove duplicate rows in a dataframe. It also show how to handle duplicate elements in a vector.

Examine duplicate rows

To quickly review rows that have duplicates, you can use get_dupes() from the janitor package. By default, all columns are considered when duplicates are evaluated - rows returned are 100% duplicates considering the values in all columns.

In the obs dataframe, the first two rows that are 100% duplicates - they have the same value in every column (including the recordID column, which is supposed to be unique - it must be some computer glitch). The returned dataframe automatically includes a new column dupe_count, showing the number of rows with that combination of duplicate values.

# 100% duplicates across all columns
obs %>% 
  janitor::get_dupes()

However, if we choose to ignore recordID, the 3rd and 4th rows rows are also duplicates. That is, they have the same values in all columns except for recordID. You can specify specific columns to be ignored in the function using a - minus symbol.

# Duplicates when column recordID is not considered
obs %>% 
  janitor::get_dupes(-recordID)         # if multiple columns, wrap them in c()

You can also positively specify the columns to consider. Below, only rows that have the same values in the name and purpose columns are returned. Notice how “amrish” now has dupe_count equal to 3 to reflect his three “contact” encounters.

*Scroll left for more rows**

# duplicates based on name and purpose columns ONLY
obs %>% 
  janitor::get_dupes(name, purpose)

See ?get_dupes for more details, or see this online reference

Keep only unique rows

To keep only unique rows of a dataframe, use distinct() from dplyr. Rows that are duplicates are removed such that only the first of such rows is kept. By default, “first” means the highest rownumber (order of rows top-to-bottom). Only unique rows are kept. In the example below, one duplicate row (the first row, for “adam”) has been removed (n is now 18, not 19 rows).

Scroll to the left to see the entire dataframe

# added to a chain of pipes (e.g. data cleaning)
obs %>% 
  distinct(across(-recordID), # reduces dataframe to only unique rows (keeps first one of any duplicates)
           .keep_all = TRUE) 

# if outside pipes, include the data as first argument 
# distinct(obs)

CAUTION: If using distinct() on grouped data, the function will apply to each group.

Deduplicate based on specific columns

You can also specify columns to be the basis for de-duplication. In this way, the de-duplication only applies to rows that are duplicates within the specified columns. Unless specified with .keep_all = TRUE, all columns not mentioned will be dropped.

In the example below, the de-duplication only applies to rows that have identical values for name and purpose columns. Thus, “brian” has only 2 rows instead of 3 - his first “contact” encounter and his only “case” encounter. To adjust so that brian’s latest encounter of each purpose is kept, see the tab on Slicing within groups.

Scroll to the left to see the entire dataframe

# added to a chain of pipes (e.g. data cleaning)
obs %>% 
  distinct(name, purpose, .keep_all = TRUE) %>%  # keep rows unique by name and purpose, retain all columns
  arrange(name)                                  # arrange for easier viewing

Duplicate elements in a vector

The function duplicated() from base R will evaluate a vector (column) and return a logical vector of the same length (TRUE/FALSE). The first time a value appears, it will return FALSE (not a duplicate), and subsequent times that value appears it will return TRUE. Note how NA is treated the same as any other value.

x <- c(1, 1, 2, NA, NA, 4, 5, 4, 4, 1, 2)
duplicated(x)
##  [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

To return only the duplicated elements, you can use brackets to subset the original vector:

x[duplicated(x)]
## [1]  1 NA  4  4  1  2

To return only the unique elements, use unique() from base R. To remove NAs from the output, nest na.omit() within unique().

unique(x)           # alternatively, use x[!duplicated(x)]
## [1]  1  2 NA  4  5
unique(na.omit(x))  # remove NAs 
## [1] 1 2 4 5

with base R

To return duplicate rows

In base R, you can also see which rows are 100% duplicates in a dataframe df with the command duplicated(df) (returns a logical vector of the rows).

Thus, you can also use the base subset [ ] on the dataframe to see the duplicated rows with df[duplicated(df),] (don’t forget the comma, meaning that you want to see all columns!).

To return unique rows

See the notes above. To see the unique rows you add the logical negator ! in front of the duplicated() function:
df[!duplicated(df),]

To return rows that are duplicates of only certain columns

Subset the df that is within the duplicated() parentheses, so this function will operate on only certain columns of the df.

To specify the columns, provide column numbers or names after a comma (remember, all this is within the duplicated() function).

Be sure to keep the comma , outside after the duplicated() function as well!

For example, to evaluate only columns 2 through 5 for duplicates: df[!duplicated(df[, 2:5]),]
To evaluate only columns name and purpose for duplicates: df[!duplicated(df[, c("name", "purpose)]),]

Slicing

To “slice” a dataframe is useful in de-duplication if you have multiple rows per functional group (e.g. per “person”) and you only want to analyze one or some of them. Think of slicing a filter on the rows, by row number/position.

The basic slice() function accepts a number n. If positive, only the nth row is returned. If negative, all rows except the nth are returned.

Variations include:

  • slice_min() and slice_max() - to keep only the row with the minimium or maximum value of the specified column. Also worked with ordered factors.
  • slice_head() and slice_tail - to keep only the first or last row
  • slice_sample() - to keep only a random sample of the rows

Use arguments n = or prop = to specify the number or proportion of rows to keep. If not using the function in a pipe chain, provide the data argument first (e.g. slice(df, n = 2)). See ?slice for more information.

Other arguments:

.order_by = - used in slice_min() and slice_max() this is a column to order by before slicing.
with_ties = - TRUE by default, meaning ties are kept.
.preserve = - FALSE by default. If TRUE then the grouping structure is re-calculated after slicing.
weight_by = - Optional, numeric column to weight by (bigger number more likely to get sampled). Also replace = for whether sampling is done with/without replacement.

TIP: When using slice_max() and slice_min(), be sure to specify/write the n = (e.g. n = 2, not just 2). Otherwise you may get an error Error:is not empty.

NOTE: You may encounter the function top_n(), which has been superseded by the slice functions.

Here, the basic slice() function is used to keep only the 4th row:

obs %>% 
  slice(4)  # keeps the 4th row only

Slice with groups

The slice_*() functions can be very useful if applied to a grouped dataframe, as the slice operation is performed on each group separately. Use the function group_by() in conjunction with slice() to group the data and then take a slice from each group.
This is helpful for de-duplication if you have multiple rows per person but only want to keep one of them. You first use group_by() with key columns that are the same, and then use a slice function on a column that will differ among the grouped rows.

In the example below, to keep only the latest encounter per person, we group the rows by name and then use slice_max() with n = 1 on the date column. Be aware! To apply a function like slice_max() on dates, the date column must be class Date.

By default, “ties” (e.g. same date in this scenario) are kept, and we would still get multiple rows for some people (e.g. adam). To avoid this we set with_ties = FALSE. We get back only one row per person.

CAUTION: If using arrange(), specify .by_group = TRUE to have the data arranged within each group.

DANGER: If with_ties = FALSE, the first row of a tie is kept. This may be deceptive. See how for Mariah, she has two encounters on her latest date (6 Jan) and the first (earliest) one was kept. Likely, we want to keep her later encounter on that day. See how to “break” these ties in the next example.

obs %>% 
  group_by(name) %>%       # group the rows by 'name'
  slice_max(date,          # keep row per group with maximum date value 
            n = 1,         # keep only the single highest row 
            with_ties = F) # if there's a tie (of date), take the first row

Breaking “ties”

Multiple slice statements can be run to “break ties”. In this case, if a person has multiple encounters on their latest date, the encounter with the latest time is kept (lubridate::hm() is used to convert the character times to a sortable time class).
Note how now, the one row kept for “Mariah” on 6 Jan is encounter 3 from 08:32, not encounter 2 at 07:25.

# Example of multiple slice statements to "break ties"
obs %>%
  group_by(name) %>%
  
  # FIRST - slice by latest date
  slice_max(date, n = 1, with_ties = TRUE) %>% 
  
  # SECOND - if there is a tie, select row with latest time; ties prohibited
  slice_max(lubridate::hm(time), n = 1, with_ties = FALSE)

In the example above, it would also have been possible to slice by encounter number, but we showed the slice on date and time for example purposes.

TIP: To use slice_max() or slice_min() on a “character” column, mutate it to an ordered factor class!

Keep all but mark them

If you want to keep all records but mark only some for analysis, consider a two-step approach utilizing a unique recordID/encounter number:

  1. Reduce/slice the orginal dataframe to only the rows for analysis. Save/retain this reduced dataframe.
  2. In the original dataframe, mark rows as appropriate with case_when(), based on whether their record unique identifier (recordID in this example) is present in the reduced dataframe.
# 1. Define dataframe of rows to keep for analysis
obs_keep <- obs %>%
  group_by(name) %>%
  slice_max(encounter, n = 1, with_ties = FALSE) # keep only latest encounter per person


# 2. Mark original dataframe
obs_marked <- obs %>%

  # make new dup_record column
  mutate(dup_record = case_when(
    
    # if record is in obs_keep dataframe
    recordID %in% obs_keep$recordID ~ "For analysis", 
    
    # all else marked as "Ignore" for analysis purposes
    TRUE                            ~ "Ignore"))

# print
obs_marked

Calculate row completeness

Create a column that contains a metric for the row’s completeness (non-missingness). This could be helpful when deciding which rows to prioritize over others when de-duplicating/slicing.

In this example, “key” columns over which you want to measure completeness are saved in a vector of column names.

Then the new column key_completeness is created with mutate(). The new value in each row is defined as a calculated fraction: the number of non-missing values in that row among the key columns, divided by the number of key columns.

This involves the function rowSums() from base R. Also used is ., which within piping refers to the dataframe at that point in the pipe (in this case, it is being subset with brackets []).

*Scroll to the right to see more rows**

# create a "key variable completeness" column
# this is a *proportion* of the columns designated as "key_vars" that have non-missing values

key_cols = c("personID", "name", "symptoms_ever")

obs %>% 
  mutate(key_completeness = rowSums(!is.na(.[,key_cols]))/length(key_cols)) 

Roll-up values

This tab describes:

  1. How to “roll-up” values from multiple rows into just one row, with some variations
  2. Once you have “rolled-up” values, how to overwrite/prioritize the values in each cell

This tab uses the example dataset from the Preparation tab.

Roll-up values into one row

The code example below uses group_by() and summarise() to group rows by person, and then paste together all unique values within the grouped rows. Thus, you get one summary row per person. A few notes:

  • A suffix is appended to all new columns ("_roll" in this example)
  • If you want to show only unique values per cell, then wrap the na.omit() with unique()
  • na.omit() removes NA values, but if this is not desired it can be removed paste0(.x)

Scroll to the left to see more rows

# "Roll-up" values into one row per group (per "personID") 
cases_rolled <- obs %>% 
  
  # create groups by name
  group_by(personID) %>% 
  
  # order the rows within each group (e.g. by date)
  arrange(date, .by_group = TRUE) %>% 
  
  # For each column, paste together all values within the grouped rows, separated by ";"
  summarise(
    across(everything(),                           # apply to all columns
           ~paste0(na.omit(.x), collapse = "; "))) # function is defined which combines non-NA values

The result is one row per group (ID), with entries arranged by date and pasted together.

This variation shows unique values only:

# Variation - show unique values only 
cases_rolled <- obs %>% 
  group_by(personID) %>% 
  arrange(date, .by_group = TRUE) %>% 
  summarise(
    across(everything(),                                   # apply to all columns
           ~paste0(unique(na.omit(.x)), collapse = "; "))) # function is defined which combines unique non-NA values

This variation appends a suffix to each column.
In this case "_roll" to signify that it has been rolled:

# Variation - suffix added to column names 
cases_rolled <- obs %>% 
  group_by(personID) %>% 
  arrange(date, .by_group = TRUE) %>% 
  summarise(
    across(everything(),                
           list(roll = ~paste0(na.omit(.x), collapse = "; ")))) # _roll is appended to column names

Overwrite values/hierarchy

If you then want to evaluate all of the rolled values, and keep only a specific value (e.g. “best” or “maximum” value), you can use mutate() across the desired columns, to implement case_when(), which uses str_detect() from the stringr package to sequentially look for string patterns and overwrite the cell content.

# CLEAN CASES
#############
cases_clean <- cases_rolled %>% 
    
    # clean Yes-No-Unknown vars: replace text with "highest" value present in the string
    mutate(across(c(contains("symptoms_ever")),                     # operates on specified columns (Y/N/U)
             list(mod = ~case_when(                                 # adds suffix "_mod" to new cols; implements case_when()
               
               str_detect(.x, "Yes")       ~ "Yes",                 # if "Yes" is detected, then cell value converts to yes
               str_detect(.x, "No")        ~ "No",                  # then, if "No" is detected, then cell value converts to no
               str_detect(.x, "Unknown")   ~ "Unknown",             # then, if "Unknown" is detected, then cell value converts to Unknown
               TRUE                        ~ as.character(.x)))),   # then, if anything else if it kept as is
      .keep = "unused")                                             # old columns removed, leaving only _mod columns

Now you can see in the column symptoms_ever that if the person EVER said “Yes” to symptoms, then only “Yes” is displayed.

Resources

Much of the information in this page is adapted from these resources and vignettes online:

datanovia

dplyr tidyverse reference

cran janitor vignette

III Data Vizualization

Age pyramids

Age pyramids can be useful to show patterns by age group. They can show gender, or the distribution of other characteristics.
These tabs demonstrate how to produce age pyramids using:

  • Fast & easy: Using the apyramid package
  • More flexible: Using ggplot()
  • Having baseline demographics displayed in the background of the pyramid
  • Using pyramid-style plots to show other types of data (e.g responses to Likert-style questions)

Overview

Age/gender demographic pyramids in R are generally made with ggplot() by creating two barplots (one for each gender), converting one’s values to negative values, and flipping the x and y axes to display the barplots vertically.

Here we offer a quick approach through the apyramid package:

  • More customizable code using the raw ggplot() commands
  • How to combine case demographic data and compare with that of a baseline population (as shown above)
  • Application of these methods to show other types of data (e.g. responses to Likert-style survey questions)

Preparation

Preparation

For this tab we use the linelist dataset that is cleaned in the Cleaning tab.

To make a traditional age/sex demographic pyramid, the data must first be cleaned in the following ways:

  • The gender column must be cleaned.
  • Age should be in an age category column, and should be an of class Factor (with correctly ordered levels)

Load packages

First, load the packages required for this analysis:

pacman::p_load(rio,       # to import data
               here,      # to locate files
               tidyverse, # to clean, handle, and plot the data (includes ggplot2 package)
               apyramid,  # a package dedicated to creating age pyramids
               stringr)   # working with strings for titles, captions, etc.

Load the data

linelist <- rio::import("linelist_cleaned.csv")

Check class of variables

Ensure that the age variable is class Numeric, and check the class and order of levels of age_cat and age_cat5

class(linelist$age_years)
## [1] "numeric"
class(linelist$age_cat)
## [1] "factor"
class(linelist$age_cat5)
## [1] "factor"
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##  1081  1148   971   837  1091   628    45     0    88
table(linelist$age_cat5, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84   85+  <NA> 
##  1081  1148   971   837   600   491   295   181    98    54    26    14     2     3     0     0     0     0    88

apyramid package

apyramid package

The package apyramid allows you to quickly make an age pyramid. For more nuanced situations, see the tab on using ggplot() to make age pyramids. You can read more about the apyramid package in its Help page by entering ?age_pyramid in your R console.

Linelist data

Linelist data

Using the cleaned linelist dataset, we can create an age pyramid with just one simple command. If you need help cleaning your data, see the handbook page on Cleaning data (LINK). In this command:

  • The data argument is set as the linelist dataframe
  • The age_group argument is set to the name (in quotes) of the numeric category variable (in this case age_cat5)
  • The split_by argument (bar colors) should be a binary column (in this case “gender”)
apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender")
## Warning: 283 missing rows were removed (88 values from `age_cat5` and 283 values from `gender`).

When using agepyramid package, if the split_by column is binary (e.g. male/female, or yes/no), then the result will appear as a pyramid. However if there are more than two values in the split_by column (not including NA), the pyramid will appears as a faceted barplot with empty bars in the background indicating the range of the un-faceted data set for the age group. Values of split_by will appear as labels at top of each facet. For example below if the split_by variable is “hospital”.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "hospital",
                      na.rm = FALSE)        # show a bar for patients missing age, (note: this changes the pyramid into a faceted barplot)

Missing values
Rows missing values for the split_by or age_group columns, if coded as NA, will not trigger the faceting shown above. By default these rows will not be shown. However you can specify that they appear, in an adjacent barplot and as a separate age group at the top, by specifying na.rm = FALSE.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender",
                      na.rm = FALSE)         # show patients missing age or gender

Proportions, colors, & aesthetics

By default, the bars display counts (not %), a dashed mid-line for each group is shown, and the colors are green/purple. Each of these parameters can all be adjusted, as shown below:

You can also add additional ggplot() commands to the plot using the standard ggplot() “+” syntax, such as aesthetic themes and label adjustments:

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender",
                      proportional = TRUE,                  # show percents, not counts
                      show_midpoint = FALSE,                # remove bar mid-point line
                      #pal = c("orange", "purple")          # can specify alt. colors here (but not labels, see below)
                      )+                 
  
  # additional ggplot commands
  theme_minimal()+                                          # simplify the background
  scale_fill_manual(values = c("orange", "purple"),         # to specify colors AND labels
                     labels = c("Male", "Female"))+
  labs(y = "Percent of all cases",                          # note that x and y labels are switched (see ggplot tab)
       x = "Age categories",                          
       fill = "Gender", 
       caption = "My data source and caption here",
       title = "Title of my plot",
       subtitle = "Subtitle with \n a second line...")+
  theme(
    legend.position = "bottom",                             # move legend to bottom
    axis.text = element_text(size = 10, face = "bold"),     # fonts/sizes, see ggplot tips page
    axis.title = element_text(size = 12, face = "bold"))
## Warning: 283 missing rows were removed (88 values from `age_cat5` and 283 values from `gender`).
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.

Aggregated data

Aggregated data

The examples above assume your data are in a linelist-like format, with one row per observation. If your data are already aggregated into counts by age category, you can still use the apyramid package, as shown below.

Let’s say that your dataset looks like this, with columns for age category, and male counts, female counts, and missing counts.
(see the handbook page on Transforming data for tips)

## `summarise()` has grouped output by 'age_cat5'. You can override using the `.groups` argument.
# View the aggregated data
DT::datatable(demo_agg, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

ggplot() perfers data in “long” format, so first pivot the data to be “long” with the pivot_longer() function from dplyr.

# pivot the aggregated data into long format
demo_agg_long <- demo_agg %>% 
  pivot_longer(c(f, m, missing_gender),            # cols to elongate
               names_to = "gender",                # name for new col of categories
               values_to = "counts") %>%           # name for new col of counts
  mutate(gender = na_if(gender, "missing_gender")) # convert "missing_gender" to NA
# View the aggregated data
DT::datatable(demo_agg_long, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Then use the split_by and count arguments of age_pyramid() to specify the respective columns:

apyramid::age_pyramid(data = demo_agg_long,
                      age_group = "age_cat5",
                      split_by = "gender",
                      count = "counts")      # give the column name for the aggregated counts
## Warning: Removed 19 rows containing missing values (position_stack).
## Warning: Removed 19 rows containing missing values.

Note in the above, that the factor order of “m” and “f” is different (pyramid reversed). To adjust the order you must re-define gender in the aggredated data as a Factor and order the levels as desired.

ggplot()

ggplot()

Using ggplot() to build your age pyramid allows for more flexibility, but requires more effort and understanding of how ggplot() works. It is also easier to accidentally make mistakes.

apyramid uses ggplot() in the background (and accepts ggplot() commands added), but this page shows how to adjust or recreate a pyramid only using ggplot(), if you wish.

Constructing the plot

Constructing the plot

First, understand that to make such a pyramid using ggplot() the approach is to:

  • Within the ggplot(), create two graphs by age category. Create one for each of the two grouping values (in this case gender). See filters applied to the data arguments in each geom_histogram() commands below.

  • If using geom_histogram(), the graphs operate off the numeric column (e.g. age_years), whereas if using geom_barplot() the graphs operate from an ordered Factor (e.g. age_cat5).

  • One graph will have positive count values, while the other will have its counts converted to negative values - this allows both graphs to be seen and compared against each other in the same plot.

  • The command coord_flip() switches the X and Y axes, resulting in the graphs turning vertical and creating the pyramid.

  • Lastly, the counts-axis labels must be specified so they appear as “positive” counts on both sides of the pyramid (despite the underlying values on one side being negative).

A simple version of this, using geom_histogram(), is below:

  # begin ggplot
  ggplot(data = linelist, aes(x = age, fill = gender)) +
  
  # female histogram
  geom_histogram(data = filter(linelist, gender == "f"),
                 breaks = seq(0,85,5),
                 colour = "white") +
  
  # male histogram (values converted to negative)
  geom_histogram(data = filter(linelist, gender == "m"),
                 breaks = seq(0,85,5),
                 aes(y=..count..*(-1)),
                 colour = "white") +
  
  # flip the X and Y axes
  coord_flip() +
  
  # adjust counts-axis scale
  scale_y_continuous(limits = c(-600, 900),
                     breaks = seq(-600,900,100),
                     labels = abs(seq(-600, 900, 100)))

DANGER: If the limits of your counts axis are set too low, and a counts bar exceeds them, the bar will disappear entirely or be artificially shortened! Watch for this if analyzing data which is routinely updated. Prevent it by having your count-axis limits auto-adjust to your data, as below.

There are many things you can change/add to this simple version, including:

  • Auto adjust counts-axis count scale to your data (avoid errors discussed in warning below)
  • Manually specify colors and legend labels
# create dataset with proportion of total
pyramid_data <- linelist %>%
  group_by(age_cat5, gender) %>% 
  summarize(counts = n()) %>% 
  ungroup() %>% 
  mutate(percent = round(100*(counts / sum(counts, na.rm=T)),1), 
         percent = case_when(
            gender == "f" ~ percent,
            gender == "m" ~ -percent,
            TRUE          ~ NA_real_))
## `summarise()` has grouped output by 'age_cat5'. You can override using the `.groups` argument.
max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)


# begin ggplot
  ggplot()+  # default x-axis is age in years;

  # case data graph
  geom_bar(data = pyramid_data,
           stat = "identity",
           aes(x = age_cat5,
               y = percent,
               fill = gender),        # 
           colour = "white")+         # white around each bar
  
  # flip the X and Y axes to make pyramid vertical
  coord_flip()+
  

  # adjust the axes scales (remember they are flipped now!)
  #scale_x_continuous(breaks = seq(0,100,5), labels = seq(0,100,5)) +
  scale_y_continuous(limits = c(min_per, max_per),
                     breaks = seq(floor(min_per), ceiling(max_per), 2),
                     labels = paste0(abs(seq(floor(min_per), ceiling(max_per), 2)), "%"))+

  # designate colors and legend labels manually
  scale_fill_manual(
    values = c("f" = "orange",
               "m" = "darkgreen"),
    labels = c("Female", "Male"),
  ) +
  
  # label values (remember X and Y flipped now)
  labs(
    x = "Age group",
    y = "Percent of total",
    fill = NULL,
    caption = stringr::str_glue("Data are from linelist \nn = {nrow(linelist)} (age or sex missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases) \nData as of: {format(Sys.Date(), '%d %b %Y')}")) +
  
  # optional aesthetic themes
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black"),
    plot.title = element_text(hjust = 0.5), 
    plot.caption = element_text(hjust=0, size=11, face = "italic")) + 
  
  ggtitle(paste0("Age and gender of cases"))
## Warning: Removed 12 rows containing missing values (position_stack).

Compare to baseline

Compare to baseline

With the flexibility of ggplot(), you can have a second layer of bars in the background that represent the true population pyramid. This can provide a nice visualization to compare the observed counts with the baseline.

Import and view the population data

# import the population demographics data
pop <- rio::import("country_demographics.csv")
# display the linelist data as a table
DT::datatable(pop, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX=T) )

First some data management steps:

Here we record the order of age categories that we want to appear. Due to some quirks the way the ggplot() is implemented, it is easiest to store these as a character vector and use them later in the plotting function.

# record correct age cat levels
age_levels <- c("0-4","5-9", "10-14", "15-19", "20-24",
                "25-29","30-34", "35-39", "40-44", "45-49",
                "50-54", "55-59", "60-64", "65-69", "70-74",
                "75-79", "80-84", "85+")

Combine the population and case data through the dplyr function bind_rows():

  • First, ensure they have the exact same column names, age categories values, and gender values
  • Make them have the same data structure: columns of age category, gender, counts, and percent of total
  • Bind them together, one on-top of the other (bind_rows())
# create/transform populaton data, with percent of total
########################################################
pop_data <- pivot_longer(pop, c(m, f), names_to = "gender", values_to = "counts") %>% # pivot gender columns longer
  mutate(data = "population",                                                         # add column designating data source
         percent  = round(100*(counts / sum(counts, na.rm=T)),1),                     # calculate % of total
         percent  = case_when(                                                        # if male, convert % to negative
                            gender == "f" ~ percent,
                            gender == "m" ~ -percent,
                            TRUE          ~ NA_real_))

Review the changed population dataset

# display the linelist data as a table
DT::datatable(pop_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Now implement the same for the case linelist. Slightly different because it begins with case-rows, not counts.

# create case data by age/gender, with percent of total
#######################################################
case_data <- linelist %>%
  group_by(age_cat5, gender) %>%  # aggregate linelist cases into age-gender groups
  summarize(counts = n()) %>%     # calculate counts per age-gender group
  ungroup() %>% 
  mutate(data = "cases",                                          # add column designating data source
         percent = round(100*(counts / sum(counts, na.rm=T)),1),  # calculate % of total for age-gender groups
         percent = case_when(                                     # convert % to negative if male
            gender == "f" ~ percent,
            gender == "m" ~ -percent,
            TRUE          ~ NA_real_))
## `summarise()` has grouped output by 'age_cat5'. You can override using the `.groups` argument.

Review the changed case dataset

# display the linelist data as a table
DT::datatable(case_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Now the two datasets are combined, one on top of the other (same column names)

# combine case and population data (same column names, age_cat values, and gender values)
pyramid_data <- bind_rows(case_data, pop_data)

Store the maximum and minimum percent values, used in the plotting funtion to define the extent of the plot (and not cut off any bars!)

# Define extent of percent axis, used for plot limits
max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)

Now the plot is made with ggplot():

  • One bar graph of population data (wider, more transparent bars)
  • One bar graph of case data (small, more solid bars)
# begin ggplot
##############
ggplot()+  # default x-axis is age in years;

  # population data graph
  geom_bar(data = filter(pyramid_data, data == "population"),
           stat = "identity",
           aes(x = age_cat5,
               y = percent,
               fill = gender),        
           colour = "black",                               # black color around bars
           alpha = 0.2,                                    # more transparent
           width = 1)+                                     # full width
  
  # case data graph
  geom_bar(data = filter(pyramid_data, data == "cases"), 
           stat = "identity",                              # use % as given in data, not counting rows
           aes(x = age_cat5,                               # age categories as original X axis
               y = percent,                                # % as original Y-axis
               fill = gender),                             # fill of bars by gender
           colour = "black",                               # black color around bars
           alpha = 1,                                      # not transparent 
           width = 0.3)+                                   # half width
  
  # flip the X and Y axes to make pyramid vertical
  coord_flip()+
  
  # adjust axes order, scale, and labels (remember X and Y axes are flipped now)
  # manually ensure that age-axis is ordered correctly
  scale_x_discrete(limits = age_levels)+ 
  
  # set percent-axis 
  scale_y_continuous(limits = c(min_per, max_per),                                          # min and max defined above
                     breaks = seq(floor(min_per), ceiling(max_per), by = 2),                # from min% to max% by 2 
                     labels = paste0(                                                       # for the labels, paste together... 
                       abs(seq(floor(min_per), ceiling(max_per), by = 2)),                  # ...rounded absolute values of breaks... 
                       "%"))+                                                               # ... with "%"
                                                                                            # floor(), ceiling() round down and up 

  # designate colors and legend labels manually
  scale_fill_manual(
    values = c("f" = "orange",         # assign colors to values in the data
               "m" = "darkgreen"),
    labels = c("f" = "Female",
               "m"= "Male"),      # change labels that appear in legend, note order
  ) +

  # plot labels, titles, caption    
  labs(
    title = "Case age and gender distribution,\nas compared to baseline population",
    subtitle = "",
    x = "Age category",
    y = "Percent of total",
    fill = NULL,
    caption = stringr::str_glue("Cases shown on top of country demographic baseline\nCase data are from linelist, n = {nrow(linelist)}\nAge or gender missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases\nCase data as of: {format(max(linelist$date_onset, na.rm=T), '%d %b %Y')}")) +
  
  # optional aesthetic themes
  theme(
    legend.position = "bottom",                             # move legend to bottom
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black"),
    plot.title = element_text(hjust = 0), 
    plot.caption = element_text(hjust=0, size=11, face = "italic"))
## Warning: Removed 12 rows containing missing values (position_stack).

Likert scale

Likert scale

The techniques used to make a population pyramid with ggplot() can also be used to make plots of Likert-scale survey data.

Import the data

# import the likert survey response data
likert_data <- rio::import("likert_data.csv")

Start with data that looks like this, with a categorical classification of each respondent (status) and their answers to 8 questions on a 4-point Likert-type scale (“Very poor”, “Poor”, “Good”, “Very good”).

# display the linelist data as a table
DT::datatable(likert_data, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX=T) )

First, some data management steps:

  • Pivot the data longer
  • Create new column direction depending on whether response was generally “positive” or “negative”
  • Set the Factor level order for the status column and the Response column
  • Store the max count value so limits of plot are appropriate
melted <- pivot_longer(likert_data, Q1:Q8, names_to = "Question", values_to = "Response") %>% 
     mutate(direction = case_when(
               Response %in% c("Poor","Very Poor") ~ "Negative",
               Response %in% c("Good", "Very Good") ~ "Positive",
               TRUE ~ "Unknown"),
            status = factor(status, levels = rev(c(
                 "Senior", "Intermediate", "Junior"))),
            Response = factor(Response, levels = c("Very Good", "Good",
                                             "Very Poor", "Poor"))) # must reverse Very Poor and Poor for ordering to work

melted_max <- melted %>% 
   group_by(status, Question) %>% 
   summarize(n = n())
## `summarise()` has grouped output by 'status'. You can override using the `.groups` argument.
melted_max <- max(melted_max$n, na.rm=T)

Now make the plot:

# make plot
ggplot()+
     # bar graph of the "negative" responses 
     geom_bar(data = filter(melted,
                            direction == "Negative"), 
              aes(x = status,
                        y=..count..*(-1),    # counts inverted to negative
                        fill = Response),
                    color = "black",
                    closed = "left", 
                    position = "stack")+
     
     # bar graph of the "positive responses
     geom_bar(data = filter(melted, direction == "Positive"),
              aes(x = status, fill = Response),
              colour = "black",
              closed = "left",
              position = "stack")+
     
     # flip the X and Y axes
     coord_flip()+
  
     # Black vertical line at 0
     geom_hline(yintercept = 0, color = "black", size=1)+
     
    # convert labels to all positive numbers
    scale_y_continuous(limits = c(-ceiling(melted_max/10)*11, ceiling(melted_max/10)*10),   # seq from neg to pos by 10, edges rounded outward to nearest 5
                       breaks = seq(-ceiling(melted_max/10)*10, ceiling(melted_max/10)*10, 10),
                       labels = abs(unique(c(seq(-ceiling(melted_max/10)*10, 0, 10),
                                            seq(0, ceiling(melted_max/10)*10, 10))))) +
     
    # color scales manually assigned 
    scale_fill_manual(values = c("Very Good"  = "green4", # assigns colors
                                  "Good"      = "green3",
                                  "Poor"      = "yellow",
                                  "Very Poor" = "red3"),
                       breaks = c("Very Good", "Good", "Poor", "Very Poor"))+ # orders the legend
     
    
     
    # facet the entire plot so each question is a sub-plot
    facet_wrap(~Question, ncol = 3)+
     
    # labels, titles, caption
    labs(x = "Respondent status",
          y = "Number of responses",
          fill = "")+
     ggtitle(str_glue("Likert-style responses\nn = {nrow(likert_data)}"))+

     # aesthetic settings
     theme_minimal()+
     theme(axis.text = element_text(size = 12),
           axis.title = element_text(size = 14, face = "bold"),
           strip.text = element_text(size = 14, face = "bold"),  # facet sub-titles
           plot.title = element_text(size = 20, face = "bold"),
           panel.background = element_rect(fill = NA, color = "black")) # black box around each facet
## Warning: Ignoring unknown parameters: closed

## Warning: Ignoring unknown parameters: closed

Resources

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.